Part I · Foundations Week 1 Published Notebook

Linear ODEs as state-space systems

Continuous-time linear systems, the matrix exponential as fundamental solution, the damped oscillator as canonical example, and the eigenvalue classification of phase portraits.

Linear ODEs as state-space systems

Chapter 1 — at a glance

Goal: introduce the continuous-time linear system $\ddt \statevec(t) = \statemat \statevec(t) + \inputmat u(t)$ as the mathematical object every later chapter discretizes, analyzes, or generalizes. By the end you can write a state-space model from an ODE, compute its fundamental solution, classify its qualitative behavior from $\statemat$ ‘s eigenvalues, and read a phase portrait.

Reading time: ~30 minutes prose; 60–90 minutes with the companions and exercises.

Key insight: every structured state-space model in this book — S4, S4D, Mamba, the delta-rule lineage, hybrids — is a discrete realization of a continuous linear system. Understanding the continuous object first means you can switch between the recurrent, convolutional, and frequency-domain views without re-deriving each one.

1.1 What is a state-space system?

A linear state-space system is the pair of equations

\ddt \statevec(t) = \statemat \statevec(t) + \inputmat u(t), \qquad y(t) = \outputmat \statevec(t) + \feedmat u(t),

where $\statevec(t) \in \R^N$ is the state at time $t \ge 0$ , $u(t) \in \R^D$ is the input signal, $y(t) \in \R^M$ is the output, and the four matrices $(\statemat, \inputmat, \outputmat, \feedmat)$ are real-valued with shapes $N \times N$ , $N \times D$ , $M \times N$ , $M \times D$ . The first equation is the state equation — a first-order linear ODE describing how the state evolves. The second is the output equation — a static linear readout. The integer $N$ is the state dimension; it controls how much history the model can carry forward.

The pair is “linear” because both equations are linear in $\statevec$ and $u$ separately. It is “state-space” because $\statevec(t)$ summarizes everything the system needs from the past in order to predict the future given future inputs — a property called the Markov property of state. Given the present state and the input from now onward, the system’s entire future is determined.

A few conventions before going further. We treat $\statevec(t)$ as a column vector and apply $\statemat$ on the left: $\statemat \statevec$ , not $\statevec \statemat$ . We absorb the feedthrough term $\feedmat u(t)$ into the output equation only — it appears nowhere in the state dynamics — and we will often set $\feedmat = 0$ in examples to keep notation light. When the input is absent ( $u \equiv 0$ ), the state equation reduces to

\ddt \statevec(t) = \statemat \statevec(t),

the homogeneous linear ODE, whose solution is the focus of §1.2.

Why this object

Three reasons the state-space form is the right starting point for a book on sequence-model architectures:

It separates dynamics from input. Whatever the input signal looks like, the dynamics matrix $\statemat$ alone determines the system’s long-run behavior — its stability, oscillation frequencies, and decay rates. Most of the analysis in Chapter 2 looks only at $\statemat$ .
It generalizes immediately. Replacing the scalar derivative $\ddt$ with a finite difference yields a discrete recurrence (Chapter 4); replacing $\R$ with $\C$ admits oscillatory modes (Chapter 6, Chapter 10); replacing the constant $\statemat$ with an input-dependent $\statemat(u_t)$ yields the selective-SSM family (Chapter 9). Each generalization preserves the state-space skeleton.
It’s the form in which continuous-time models are discretized. Every SSM derivation in the literature starts from a continuous state equation and applies a numerical integration scheme to get a discrete recurrence. If you don’t know the continuous object, the discrete one is opaque.

1.2 The matrix exponential and fundamental solutions

The homogeneous ODE $\ddt \statevec = \statemat \statevec$ with initial condition $\statevec(0) = \statevec_0$ has a unique solution for every $\statevec_0 \in \R^N$ — and that solution is

\statevec(t) = e^{\statemat t} \statevec_0,

where $e^{\statemat t}$ is the matrix exponential of $\statemat t$ , defined by the same series as the scalar exponential:

Definition 1.1.

For any square matrix $M \in \R^{N \times N}$ , the matrix exponential is

e^M := \sum_{k=0}^{\infty} \frac{M^k}{k!} = I + M + \tfrac{1}{2} M^2 + \tfrac{1}{6} M^3 + \cdots

The series converges absolutely for every $M$ (the entrywise $\ell^1$ -norm of the partial sums is bounded by $e^{\norm{M}}$ ), so $e^M$ is always well-defined.

The fundamental solution property — that $t \mapsto e^{\statemat t} \statevec_0$ solves the homogeneous ODE — is verified by differentiating the series term-by-term and using $\frac{d}{dt} \frac{(\statemat t)^k}{k!} = \frac{\statemat^k t^{k-1}}{(k-1)!}$ . The resulting derivative is $\statemat \cdot e^{\statemat t} \statevec_0$ , which matches $\statemat \statevec(t)$ . Existence and uniqueness for the inhomogeneous case ( $u \not\equiv 0$ ) follow the variation of parameters formula

\statevec(t) = e^{\statemat t} \statevec_0 + \int_0^t e^{\statemat (t-s)} \inputmat u(s) \, ds,

an integral that recurs (in discretized form) as the convolution kernel of every LTI SSM in Chapter 8.

Eigenvalue structure determines everything

The matrix exponential’s behavior is governed entirely by the eigenvalues of $\statemat$ . If $\statemat$ is diagonalizable, $\statemat = V \Lambda V^{-1}$ with $\Lambda = \diag(\lambda_1, \ldots, \lambda_N)$ , then

e^{\statemat t} = V \, e^{\Lambda t} \, V^{-1} = V \, \diag(e^{\lambda_1 t}, \ldots, e^{\lambda_N t}) \, V^{-1}.

So $e^{\statemat t}$ acts on the eigenbasis as independent scalar exponentials. The real parts of $\lambda_i$ control decay or growth: $\Re(\lambda_i) < 0$ means $e^{\lambda_i t} \to 0$ ; $\Re(\lambda_i) > 0$ means blow-up. The imaginary parts control oscillation frequency. Three regimes recur throughout the book:

All eigenvalues with negative real part — the system is asymptotically stable: every trajectory $\statevec(t) \to 0$ as $t \to \infty$ . This is the regime in which a recurrence “forgets” its initial state, and it is the design target for the $\statemat$ matrices of S4, Mamba, and friends.
At least one eigenvalue with positive real part — the system is unstable: some initial conditions blow up. A trained SSM whose $\statemat$ drifts into this regime is the dominant numerical failure mode (Chapter 5).
Purely imaginary eigenvalues — provided they are non-defective, the system is marginally stable and oscillates without decay; a defective imaginary eigenvalue instead picks up polynomial-in- $t$ growth and is unstable (Chapter 2 makes this precise). The non-defective modes are the energy-preserving modes of Hamiltonian systems and the natural home of symplectic discretization (Chapter 6).

When $\statemat$ is not diagonalizable — when it has repeated eigenvalues with deficient eigenspaces — the matrix exponential picks up polynomial-in- $t$ factors on top of the $e^{\lambda_i t}$ terms. The Jordan normal form (Chapter 3, §3.1) gives the precise statement. For generic dense random $\statemat$ , non-diagonalizability is a measure-zero, codimension-one event. But it is not merely pathological: it appears by design in structured systems — the critically-damped oscillator of §1.3 (a repeated real eigenvalue), and the structured/learned $\statemat$ of HiPPO-LegS and S4 (Chapter 3). Where it occurs, those polynomial-in- $t$ factors are the whole story, not an edge case.

1.3 The damped harmonic oscillator

The canonical small example. A unit mass on a spring with stiffness $k > 0$ and damping coefficient $c > 0$ , displacement $q(t)$ , evolves according to Newton’s second law,

\ddot q(t) + c \, \dot q(t) + k \, q(t) = 0.

This is a second-order scalar ODE, not first-order, so it does not yet fit the state-space template. We lift it by treating velocity as an extra state coordinate: let $\statevec(t) = (q(t), \dot q(t))^\top \in \R^2$ . Then $\ddt \statevec = \statemat \statevec$ with

\statemat = \begin{pmatrix} 0 & 1 \\ -k & -c \end{pmatrix}.

This lifting trick — introduce auxiliary states to reduce ODE order to one — is the same procedure by which every higher-order ODE becomes a first-order state-space system, and it foreshadows how implicit higher-order schemes (Chapter 6) work internally.

The eigenvalues of $\statemat$ are the roots of the characteristic polynomial $\lambda^2 + c \lambda + k = 0$ :

\lambda_\pm = -\frac{c}{2} \pm \frac{1}{2}\sqrt{c^2 - 4k}.

Three sub-cases follow from the sign of the discriminant $c^2 - 4k$ :

Overdamped ( $c^2 > 4k$ ): two real negative eigenvalues. Both modes decay monotonically; the displacement approaches zero without oscillation.
Critically damped ( $c^2 = 4k$ ): a repeated negative eigenvalue $\lambda = -c/2$ . The matrix $\statemat$ is in this case not diagonalizable, and the solution contains a $t \cdot e^{\lambda t}$ term.
Underdamped ( $c^2 < 4k$ ): a complex-conjugate eigenpair $\lambda_\pm = -c/2 \pm i\omega$ , with $\omega = \tfrac{1}{2}\sqrt{4k - c^2}$ . Trajectories spiral toward the origin: oscillation at angular frequency $\omega$ with envelope decaying as $e^{-ct/2}$ .

The total energy $E(t) = \tfrac{1}{2}(\dot q^2 + k q^2)$ is a useful diagnostic. Its time derivative is $\dot E = -c \dot q^2 \le 0$ , i.e. energy decreases monotonically whenever $\dot q \ne 0$ . Both stationary and oscillating trajectories satisfy this. The companion damped_oscillator.py simulates the underdamped regime and plots $E(t)$ alongside the trajectory; the energy curve gives an at-a-glance check that the numerical integration preserves the dissipation structure of the continuous system.

Energy decay of the damped harmonic oscillator over time, plotted on linear and log scales. — Energy E(t) of the underdamped oscillator (k=4, c=0.2) decaying exponentially in time. Left: linear scale, showing the envelope shape. Right: log scale, where the constant slope confirms exponential energy decay at rate c — twice the c/2 rate of the amplitude envelope, since energy scales as amplitude squared. Produced by companions/ch01/jax/damped_oscillator.py.

1.4 Coupled systems and the Jacobian

The damped oscillator is a 2-dimensional toy. Real systems — and trained SSMs — live in much higher dimensions, and they typically arise from coupling many simpler subsystems. The bookkeeping is straightforward once you accept that the state vector $\statevec$ stacks all subsystem states and $\statemat$ encodes both intra-subsystem dynamics and inter-subsystem coupling.

A useful larger example is a ring of coupled oscillators: $n$ identical damped oscillators arranged on a circle, each coupled to its two nearest neighbors by springs of stiffness $\kappa$ . The displacements $q_1, \ldots, q_n$ and velocities $\dot q_1, \ldots, \dot q_n$ stack into a state vector $\statevec \in \R^{2n}$ . The $\statemat$ matrix has a $2 \times 2$ block-diagonal structure for the per-oscillator dynamics, plus an off-diagonal coupling pattern that links each oscillator’s velocity equation to its neighbors’ positions:

\ddot q_i = -k q_i - c \dot q_i + \kappa (q_{i-1} - 2 q_i + q_{i+1}), \qquad i = 1, \ldots, n,

with indices taken mod $n$ . The bracketed term $q_{i-1} - 2 q_i + q_{i+1}$ is the discrete Laplacian of $q$ around the ring; it’s the same discretization of $\partial^2/\partial x^2$ you would use in a finite-difference method for the wave equation.

The $2n \times 2n$ Jacobian matrix $\jacobian$ — which for a linear system is just $\statemat$ — has eigenvalues that are easy to compute analytically because of the ring’s circulant structure. They come in complex-conjugate pairs

\lambda_\pm^{(m)} = -\frac{c}{2} \pm i \sqrt{k + 2\kappa(1 - \cos(2\pi m/n)) - c^2/4}, \qquad m = 0, 1, \ldots, n-1,

(for the underdamped regime). Each $m$ indexes a standing wave mode on the ring with wavenumber $2\pi m / n$ ; the $m=0$ mode is uniform translation (no spatial structure), $m = n/2$ (when $n$ is even) is the maximally-oscillating mode where adjacent oscillators move in antiphase. The companion coupled_oscillators.py constructs $\statemat$ for $n = 8$ and visualizes the eigenvalues in the complex plane.

Eigenvalues of the coupled-oscillator Jacobian in the complex plane, showing two arcs corresponding to the underdamped modes. — Eigenvalues of the ring-of-oscillators Jacobian for n=8, k=4, c=0.2, κ=1. All 16 eigenvalues sit in the left half-plane (asymptotically stable) and arrange into two arcs corresponding to the two underdamped families. The arc curvature reflects the dispersion relation ω(m) of the spatial Laplacian. Produced by companions/ch01/jax/coupled_oscillators.py.

The key qualitative lesson: once you have the right state-space lift, even a system with rich spatial structure reduces to “compute eigenvalues of $\statemat$ , classify by real and imaginary parts.” The eigenvalue distribution of a trained SSM’s $\statemat$ — pre-training, mid-training, and converged — is one of the most informative diagnostics you can compute (Chapter 2, §2.2; Chapter 5).

The Jacobian beyond linear systems

For a nonlinear system $\ddt \statevec = f(\statevec)$ , the Jacobian $\jacobian(\statevec) = \partial f / \partial \statevec$ is the matrix of partial derivatives at $\statevec$ . The linearization $\ddt \delta\statevec = \jacobian(\statevec^*) \delta\statevec$ around a fixed point $\statevec^*$ (where $f(\statevec^*) = 0$ ) describes how small perturbations $\delta\statevec$ evolve. The eigenvalues of $\jacobian(\statevec^*)$ then determine local stability of $\statevec^*$ by the same classification as in §1.2.

Trained recurrent networks — including SSMs in inference mode — are formally nonlinear (the state update is state = f(state, input) where f includes the input-dependent matrices). The linearization-around-a-state gives a local state-space system at each time step, and tracking how that local $\statemat$ varies along a trajectory is the foundation of the Lyapunov-exponent analysis in Chapter 2.

1.5 Phase portraits and the eigenvalue classification

A phase portrait is the geometric picture of all trajectories of an autonomous system in state space, drawn as oriented curves. For 2-dimensional linear systems $\ddt \statevec = \statemat \statevec$ on $\R^2$ , the topology of the portrait is determined entirely by the trace $\tr(\statemat)$ and the determinant $\det(\statemat)$ — equivalently, by the eigenvalues — and the standard classification gives six qualitative behaviors:

| Region in $(\tr, \det)$ -plane | Eigenvalues | Phase portrait | |---|---|---| | $\det > 0$ , $\tr^2 - 4\det > 0$ , $\tr < 0$ | Real negative | Stable node | | $\det > 0$ , $\tr^2 - 4\det > 0$ , $\tr > 0$ | Real positive | Unstable node | | $\det > 0$ , $\tr^2 - 4\det < 0$ , $\tr < 0$ | Complex conjugate, $\Re < 0$ | Stable spiral | | $\det > 0$ , $\tr^2 - 4\det < 0$ , $\tr > 0$ | Complex conjugate, $\Re > 0$ | Unstable spiral | | $\det > 0$ , $\tr = 0$ | Pure imaginary | Center (closed orbits) | | $\det < 0$ | Real, opposite signs | Saddle |

The damped oscillator’s overdamped and underdamped sub-cases realize two of these — stable node and stable spiral — while the critically-damped case sits on the boundary $\tr^2 = 4\det$ as a stable degenerate node, a borderline behavior the six-way classification above does not tabulate. The undamped limit $c \to 0$ slides the eigenvalues onto the imaginary axis, turning the spiral into a center; this is the Hamiltonian limit where energy is conserved exactly.

For higher-dimensional systems the topology is richer (think: 3-D systems can have spirals around line-shaped invariant manifolds), but the building blocks are the same: each pair of complex-conjugate eigenvalues contributes a 2-D spiral subspace, each real eigenvalue contributes a 1-D direction of decay or growth. This decomposition by invariant subspaces — equivalently, by the Jordan blocks of $\statemat$ — is the geometric content of the matrix exponential’s eigenvalue structure (§1.2).

1.6 A preview of frequency response

The variation-of-parameters formula

\statevec(t) = e^{\statemat t} \statevec_0 + \int_0^t e^{\statemat (t-s)} \inputmat u(s) \, ds

expresses the state as a convolution of the input with the kernel $K(t) := e^{\statemat t} \inputmat$ . The output $y(t) = \outputmat \statevec(t)$ (taking $\feedmat = 0$ ) is therefore

y(t) = \outputmat e^{\statemat t} \statevec_0 + \int_0^t \outputmat e^{\statemat (t-s)} \inputmat \, u(s) \, ds.

The function $h(t) := \outputmat e^{\statemat t} \inputmat$ for $t \ge 0$ (and zero otherwise) is the impulse response of the system: the output you would observe if the input were a Dirac delta $u(t) = \delta(t)$ .

Taking the Laplace transform turns convolution into multiplication. With $\widehat y(s) := \int_0^\infty y(t) e^{-st} dt$ and similarly for $\widehat u$ , you get

\widehat y(s) = H(s) \, \widehat u(s) + \text{(initial-condition terms)},

where $H(s) := \outputmat (s I - \statemat)^{-1} \inputmat$ is the transfer function. The eigenvalues of $\statemat$ — already shown to govern $e^{\statemat t}$ — appear as the poles of $H(s)$ , since $(sI - \statemat)^{-1}$ blows up exactly when $s$ equals an eigenvalue of $\statemat$ . The poles’ location in the complex $s$ -plane mirrors the eigenvalue classification of §1.5: left-half-plane poles ⇔ stable system; right-half-plane poles ⇔ unstable; imaginary-axis poles ⇔ marginally stable / oscillatory.

This connection is the bridge to Chapter 8, where you’ll see that the S4 layer’s convolutional view is literally a numerical evaluation of the impulse response $h(t)$ at uniformly spaced time points, and to Chapter 11, where the spectral structure of long convolution kernels (Hyena, RetNet) is analyzed via the same transfer-function framing.

1.7 What’s next

The next two chapters develop the analytical machinery that this chapter has only sketched. Chapter 2 makes the eigenvalue-stability story rigorous (Lyapunov, A-stability, BIBO) and adds the QR-based computational method for tracking eigenvalue trajectories during training. Chapter 3 introduces the structured-matrix vocabulary (Toeplitz, semiseparable, Vandermonde, Cauchy) needed to discuss the kernel constructions of S4 and friends. Chapters 4–6 then take the continuous state-space system here and discretize it — first crudely (ZOH, bilinear), then with full numerical-analysis machinery (Butcher tableau, A-stability), then via implicit and structure-preserving methods (Gauss–Legendre, symplectic integrators), which is where the active research pilot’s anchor (the C1 pilot) lives.

You can also profitably jump ahead: Chapter 8’s LTI SSM presentation reuses everything in this chapter and is self-contained for readers who want to see the SSM application before working through the full math foundation.

1.8 Exercises

Seven problems mixing computation and theory. Short/numerical exercises (1–4) have inline collapsible solutions; long/proof exercises (5–7) have full worked solutions in §1.9.

Exercise 1.1 (computation)

Compute the matrix exponential of $\statemat = \begin{pmatrix} 0 & 1 \\ -1 & 0 \end{pmatrix}$ by hand, term-by-term, and verify it equals the rotation matrix $R(t) = \begin{pmatrix} \cos t & \sin t \\ -\sin t & \cos t \end{pmatrix}$ .

Solution

Compute powers: $\statemat^2 = \begin{pmatrix} -1 & 0 \\ 0 & -1 \end{pmatrix} = -I$ , so $\statemat^3 = -\statemat$ , $\statemat^4 = I$ , and the pattern repeats with period 4. Splitting the series by parity:

e^{\statemat t} = \sum_{k} \frac{(\statemat t)^k}{k!} = I \sum_{k} \frac{(-1)^k t^{2k}}{(2k)!} + \statemat \sum_k \frac{(-1)^k t^{2k+1}}{(2k+1)!} = I \cos t + \statemat \sin t,

which expands to $\begin{pmatrix} \cos t & \sin t \\ -\sin t & \cos t \end{pmatrix} = R(t)$ . ∎

Exercise 1.2 (computation)

For the damped oscillator $\ddot q + c\dot q + k q = 0$ with $k = 1$ and $c = 0.5$ , classify the damping regime and find the eigenvalues of $\statemat$ .

Solution

Discriminant: $c^2 - 4k = 0.25 - 4 = -3.75 < 0$ , so the system is underdamped. Eigenvalues: $\lambda_\pm = -0.25 \pm i \tfrac{1}{2}\sqrt{3.75} \approx -0.25 \pm 0.968 i$ . The trajectory spirals toward the origin with envelope decay rate $0.25$ and angular frequency $\approx 0.968$ .

Exercise 1.3 (computation)

Lift the third-order ODE $q''' + 2 q'' + q = u(t)$ into a 3-dimensional state-space system $\ddt \statevec = \statemat \statevec + \inputmat u$ . Write $\statemat$ and $\inputmat$ explicitly.

Solution

Let $\statevec = (q, q', q'')^\top$ . Then $\dot \statevec = (q', q'', q''')^\top$ , and from the ODE $q''' = -2 q'' - q + u$ , so

\statemat = \begin{pmatrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ -1 & 0 & -2 \end{pmatrix}, \qquad \inputmat = \begin{pmatrix} 0 \\ 0 \\ 1 \end{pmatrix}.

The bottom row encodes the ODE’s coefficient signature; the upper-triangular ones implement the bookkeeping $q' \mapsto q', q'' \mapsto q''$ . This is the companion form of the ODE.

Exercise 1.4 (computation)

For the ring of $n = 4$ identical damped oscillators with $k = 1$ , $c = 0$ , $\kappa = 0.5$ (no damping, symmetric coupling), use the formula in §1.4 to find the eigenvalues of $\statemat$ analytically. Confirm they lie on the imaginary axis (since $c = 0$ ) and verify by running companions/ch01/jax/coupled_oscillators.py with $n = 4$ .

Solution

With $c = 0$ , $\lambda_\pm^{(m)} = \pm i\sqrt{k + 2\kappa(1 - \cos(2\pi m/n))}$ for $m = 0, 1, 2, 3$ . Numerically:

$m = 0$ : $\lambda_\pm = \pm i\sqrt{1} = \pm i$
$m = 1$ : $\lambda_\pm = \pm i\sqrt{1 + 2 \cdot 0.5 \cdot 1} = \pm i\sqrt{2}$
$m = 2$ : $\lambda_\pm = \pm i\sqrt{1 + 2 \cdot 0.5 \cdot 2} = \pm i\sqrt{3}$
$m = 3$ : same as $m = 1$ by symmetry ⇒ $\pm i\sqrt{2}$

All 8 eigenvalues are pure imaginary (centers), as expected for an undamped conservative system. The numerical verification reproduces these (modulo double-precision noise).

Exercise 1.5 (theory) — solution in §1.9

Prove that the matrix-exponential series $\sum_k M^k/k!$ converges absolutely (entrywise) for every square matrix $M$ , with bound $\norm{e^M}_F \le e^{\norm{M}_F}$ where $\norm{\cdot}_F$ is the Frobenius norm.

Exercise 1.6 (theory) — solution in §1.9

Show that if $\statemat$ is skew-symmetric ( $\statemat^\top = -\statemat$ ), then $e^{\statemat t}$ is orthogonal for every $t \in \R$ (that is, $(e^{\statemat t})^\top e^{\statemat t} = I$ ). Use this to give a geometric reason why the rotation matrix of Exercise 1.1 is orthogonal.

Exercise 1.7 (theory) — solution in §1.9

Prove the variation of parameters formula: the unique solution of $\ddt \statevec = \statemat \statevec + \inputmat u(t)$ with $\statevec(0) = \statevec_0$ is

\statevec(t) = e^{\statemat t} \statevec_0 + \int_0^t e^{\statemat (t-s)} \inputmat u(s) \, ds.

1.9 Full solutions to theory exercises

Solution to Exercise 1.5

Let $\norm{\cdot}_F$ denote the Frobenius norm, which satisfies the sub-multiplicative property $\norm{AB}_F \le \norm{A}_F \norm{B}_F$ . Then $\norm{M^k}_F \le \norm{M}_F^k$ by induction. Therefore

\left\| \sum_{k=0}^{K} \frac{M^k}{k!} \right\|_F \le \sum_{k=0}^K \frac{\norm{M^k}_F}{k!} \le \sum_{k=0}^K \frac{\norm{M}_F^k}{k!} \to e^{\norm{M}_F} \quad \text{as } K \to \infty.

The partial sums are entrywise bounded by terms of a convergent scalar series, so each matrix entry is Cauchy in $K$ and converges. The limit is what we call $e^M$ . The bound $\norm{e^M}_F \le e^{\norm{M}_F}$ follows by passing to the limit. ∎

Solution to Exercise 1.6

We have $\ddt e^{\statemat t} = \statemat e^{\statemat t}$ and $\ddt (e^{\statemat t})^\top = (e^{\statemat t})^\top \statemat^\top$ . Let $Q(t) := (e^{\statemat t})^\top e^{\statemat t}$ . Then

\ddt Q = (e^{\statemat t})^\top \statemat^\top e^{\statemat t} + (e^{\statemat t})^\top \statemat e^{\statemat t} = (e^{\statemat t})^\top (\statemat^\top + \statemat) e^{\statemat t} = 0,

since $\statemat^\top + \statemat = 0$ by skew-symmetry. So $Q(t) = Q(0) = I$ for all $t$ , i.e. $e^{\statemat t}$ is orthogonal.

Geometric interpretation: the rotation matrix $R(t)$ of Exercise 1.1 was generated by $\statemat = \begin{pmatrix} 0 & 1 \\ -1 & 0 \end{pmatrix}$ , which is skew-symmetric. So $R(t)$ is orthogonal — which we already knew, but this derivation pins down why: orthogonality follows structurally from skew-symmetry of the generator, not from any property of the cos/sin functions. ∎

Solution to Exercise 1.7

Define $\statevec(t)$ by the formula on the right-hand side. We verify (a) the initial condition and (b) the ODE.

Initial condition. At $t = 0$ the integral vanishes (zero-length interval) and $e^{\statemat \cdot 0} = I$ , so $\statevec(0) = \statevec_0$ . ✓

ODE. Differentiate using the product rule on the integral (Leibniz’s rule for differentiating under the integral sign):

\begin{aligned} \ddt \statevec(t) &= \ddt\left[ e^{\statemat t} \statevec_0 \right] + \ddt \int_0^t e^{\statemat (t-s)} \inputmat u(s) ds \\ &= \statemat e^{\statemat t} \statevec_0 + \left[ e^{\statemat(t-t)} \inputmat u(t) + \int_0^t \frac{\partial}{\partial t} e^{\statemat(t-s)} \inputmat u(s) ds \right] \\ &= \statemat e^{\statemat t} \statevec_0 + \inputmat u(t) + \statemat \int_0^t e^{\statemat(t-s)} \inputmat u(s) ds \\ &= \statemat \left( e^{\statemat t} \statevec_0 + \int_0^t e^{\statemat(t-s)} \inputmat u(s) ds \right) + \inputmat u(t) \\ &= \statemat \statevec(t) + \inputmat u(t). \quad \checkmark \end{aligned}

Uniqueness follows from the classical Picard–Lindelöf theorem applied to the right-hand side $f(\statevec, t) = \statemat \statevec + \inputmat u(t)$ , which is Lipschitz in $\statevec$ uniformly in $t$ with constant $\norm{\statemat}$ (see Hairer–Nørsett–Wanner Hairer et al. (1993) for the standard proof). ∎

1.10 Companion code

Three JAX companions and one PyTorch companion for Chapter 1.

JAX (companions/ch01/jax/):

damped_oscillator.py — simulates the underdamped oscillator, plots energy decay (Figure 1.1)
coupled_oscillators.py — constructs the ring-Laplacian state matrix, plots eigenvalue spectrum (Figure 1.2)
matrix_exponential.py — compares scipy.linalg.expm against truncated series sums; the truncated series catastrophically diverges for matrices with large spectral radius (auxiliary, used by Exercise 1.5)

PyTorch (companions/ch01/torch/):

matrix_exponential.py — the truncated-series-vs-torch.linalg.matrix_exp comparison, making the idiomatic-JAX vs idiomatic-PyTorch contrast concrete (compute-and-parity only; the JAX companions produce the figures).
tests/ — cross-framework parity: the torch partial sums and convergence curves match their JAX counterparts.

JAX companions import their plotting style from companions/_shared/plot_utils.py. To run from the repo root:

PYTHONPATH=. python companions/ch01/jax/damped_oscillator.py
PYTHONPATH=. python companions/ch01/jax/coupled_oscillators.py
PYTHONPATH=. python companions/ch01/jax/matrix_exponential.py
PYTHONPATH=. python companions/ch01/torch/matrix_exponential.py

Figures are written to public/figures/ch01/ and referenced from the <Figure> components above.

Part I · Foundations Week 2 Published

Stability theory: Lyapunov, A-stability, BIBO

Three distinct notions of stability — Lyapunov (eigenvalue criterion + QR-based computation), A-stability (regions of absolute stability for ODE integrators), and BIBO (bounded-input bounded-output for LTI systems).

Stability theory: Lyapunov, A-stability, BIBO

Chapter 2 — at a glance

Goal: develop three notions of stability that the rest of the curriculum treats as distinct objects. Lyapunov stability characterizes asymptotic behavior of trajectories in the continuous system. A-stability characterizes which step sizes a numerical integrator can take without amplifying errors. BIBO stability characterizes the input–output mapping under bounded forcing. The three coincide in special cases (LTI + LHP eigenvalues) but diverge in the general case; conflating them is one of the more common errors in the SSM literature.

Reading time: ~35 minutes prose; 90 minutes with the QR-Lyapunov companion.

Why three? Each captures a different failure mode. Lyapunov tells you the continuous-time model is asymptotically stable; A-stability tells you your discretization of it is. A continuous LTI system can be Lyapunov-stable while your Euler discretization of it explodes — and the gap between the two is exactly what numerical analysis exists to characterize.

2.1 Three notions of stability

For the rest of this chapter, the system under analysis is the LTI state-space model from Chapter 1:

\ddt \statevec(t) = \statemat \statevec(t) + \inputmat u(t), \qquad y(t) = \outputmat \statevec(t),

with $\statemat \in \R^{N \times N}$ . The three notions:

Lyapunov stability asks: with $u \equiv 0$ , do trajectories $\statevec(t) = e^{\statemat t} \statevec_0$ stay bounded as $t \to \infty$ ? Asymptotically stable if $\statevec(t) \to 0$ .
A-stability is a property of a numerical integrator, not of the system. It asks: does the integrator preserve Lyapunov stability for every stable LTI test problem, at every step size $\stepsize > 0$ ? If yes, the method is A-stable.
BIBO stability asks: for the input–output mapping $u \mapsto y$ , does every bounded input ( $\sup_t \abs{u(t)} < \infty$ ) produce a bounded output? The system is BIBO-stable if yes.

When $\statemat$ has all eigenvalues in the open left half-plane (LHP), all three coincide for the LTI case. But:

A system can be Lyapunov-stable but not asymptotically stable (centers: pure imaginary eigenvalues, oscillations don’t decay).
A-stability is an integrator property; the explicit Euler method is not A-stable, while the implicit Euler method is. The same continuous system therefore may need a stable integrator to preserve its stability.
BIBO depends on the transfer function $H(s) = \outputmat (sI - \statemat)^{-1} \inputmat$ and on the choice of $(\statemat, \inputmat, \outputmat)$ realization. Two systems with the same $\statemat$ can have different BIBO properties if their $\outputmat$ “hides” unstable modes.

The rest of the chapter develops each notion in turn.

2.2 Lyapunov stability: theory

We restrict to the homogeneous case $\ddt \statevec = \statemat \statevec$ (zero input). The eigenvalue criterion is essentially the content of Chapter 1, §1.2, formalized:

Theorem 2.1.

Let $\statemat \in \R^{N \times N}$ with eigenvalues $\lambda_1, \ldots, \lambda_N$ . The system $\ddt \statevec = \statemat \statevec$ is:

Lyapunov-stable (trajectories bounded) iff every $\lambda_i$ satisfies $\Re(\lambda_i) \le 0$ , and for every $\lambda_i$ on the imaginary axis ( $\Re(\lambda_i) = 0$ ), its algebraic multiplicity equals its geometric multiplicity (no defective Jordan blocks).
Asymptotically stable (trajectories $\to 0$ ) iff every $\lambda_i$ satisfies $\Re(\lambda_i) < 0$ , i.e. all eigenvalues are in the open left half-plane.
Unstable otherwise.

The proof reduces to the matrix-exponential formula. If $\statemat = V \Lambda V^{-1}$ is diagonalizable, then $\statevec(t) = V e^{\Lambda t} V^{-1} \statevec_0$ and the $i$ -th eigencomponent decays like $e^{\Re(\lambda_i) t}$ . The Jordan-block caveat in case (1) handles the case where a defective imaginary eigenvalue produces polynomial growth in $t$ even though the real part is zero.

For nonlinear systems, the local-linearization picture from Chapter 1, §1.4 makes the same theorem applicable at each fixed point: the fixed point is asymptotically stable if the Jacobian’s eigenvalues all sit in the LHP. This is the principle that makes “check the eigenvalues” the first move in nearly every SSM stability analysis.

The Lyapunov equation

There’s an alternative characterization of asymptotic stability that doesn’t require computing eigenvalues. The continuous-time Lyapunov equation

\statemat^\top P + P \statemat = -Q

(with $Q$ a given symmetric positive-definite matrix) has a unique symmetric positive-definite solution $P$ if and only if $\statemat$ is asymptotically stable. The function $V(\statevec) = \statevec^\top P \statevec$ is then a Lyapunov function for the system: $\dot V = -\statevec^\top Q \statevec < 0$ for all nonzero $\statevec$ , so $V$ decreases along trajectories and forces them to the origin.

For SSM analysis, the Lyapunov-equation view is occasionally useful — for example, the controllability and observability Gramians of an LTI realization are Lyapunov solutions Antoulas (2005) — but the eigenvalue criterion is the more direct tool.

2.3 Lyapunov exponents: computation

The eigenvalues of $\statemat$ work as a stability diagnostic for LTI systems, where $\statemat$ is constant. For systems where the linearization $\jacobian(t)$ varies with time (e.g. a trained recurrent network in inference mode, where the per-step Jacobian depends on the input), eigenvalues at any single $t$ don’t tell the long-run story. The right tool is the Lyapunov exponent, which generalizes the eigenvalue’s real part to time-varying systems.

Definition 2.2.

Given a sequence of transition matrices $\jacobian_1, \jacobian_2, \ldots$ (the Jacobians of an iterated map at points along a trajectory), let $\Phi_T := \jacobian_T \jacobian_{T-1} \cdots \jacobian_1$ . The $i$ -th Lyapunov exponent is

\lyapexp_i := \lim_{T \to \infty} \frac{1}{T} \log \sigma_i(\Phi_T),

where $\sigma_1(\Phi_T) \ge \sigma_2(\Phi_T) \ge \cdots$ are the singular values of $\Phi_T$ . The vector $(\lyapexp_1, \ldots, \lyapexp_N)$ is the Lyapunov spectrum.

Three things make Lyapunov exponents the right object:

They generalize eigenvalues to time-varying systems. For a constant Jacobian $\jacobian_t \equiv \statemat$ , $\lyapexp_i = \log\abs{\mu_i(\statemat)}$ where $\mu_i$ are the eigenvalues’ magnitudes (in the discrete case) or $\Re(\lambda_i)$ (in the continuous case). The connection back to Chapter 1’s eigenvalue classification is exact in the autonomous case.
The maximal Lyapunov exponent $\lyapexp_1$ measures sensitive dependence. A positive $\lyapexp_1$ means infinitesimally close initial conditions diverge exponentially (chaos); $\lyapexp_1 = 0$ is the edge-of-chaos regime; $\lyapexp_1 < 0$ means contracting dynamics.
They are robust to coordinate changes. Eigenvalues of the per-step Jacobian depend on the basis you write the state in; Lyapunov exponents do not.

The QR algorithm

Computing $\sigma_i(\Phi_T)$ for large $T$ directly is numerically impossible: even for a well-behaved system with $\lyapexp_1 = 0.1$ , after $T = 100$ steps the largest singular value is $e^{10} \approx 22000$ and the smallest may be $e^{-10} \approx 5 \times 10^{-5}$ . The product matrix $\Phi_T$ becomes either numerical zero or numerical infinity in finite-precision arithmetic, and finite-precision singular values lose all meaning. The fix, due to Benettin et al. Benettin et al. (1980) , is to factor the growth out at every step using QR decomposition.

The QR-based Lyapunov algorithm (the version that has become standard in dynamical-systems software):

Initialize Q_0 = I (identity), running sums L_i = 0 for i = 1..N.
For t = 1, 2, ..., T:
    M_t = J_t @ Q_{t-1}             # one-step forward propagation
    Q_t, R_t = qr(M_t)              # extract orthogonal + upper triangular
    Update L_i += log |R_t[i,i]| for each i
At the end:
    Lyapunov_i = L_i / T

The key insight: by re-orthonormalizing the propagated frame at every step, we factor the exponential growth into the diagonal of the upper-triangular $R_t$ rather than letting it accumulate in $\Phi_T$ . The diagonal entries of $R_t$ are well-conditioned numbers near 1, and their logarithms sum nicely to give the Lyapunov exponents.

The companion lyapunov_qr.py implements this algorithm and applies it to the ring of coupled oscillators from Chapter 1, §1.4. The resulting spectrum’s structure — all exponents negative for the damped case, all near zero for the undamped case — matches the eigenvalue analysis exactly, validating the implementation.

Lyapunov spectrum of the ring-of-oscillators system for damped and undamped regimes. — Lyapunov spectrum (sorted exponents λ_1 ≥ λ_2 ≥ ...) computed via the QR algorithm for the ring of n=8 oscillators. Damped case (c=0.2): all 16 exponents are negative, indicating asymptotically stable dynamics — every direction in state space contracts. Undamped case (c=0): all exponents are zero, indicating marginally stable (energy-preserving) dynamics. The agreement with the eigenvalue real-parts of the autonomous Jacobian validates the implementation. Produced by companions/ch02/jax/lyapunov_qr.py.

2.4 A-stability: integrator regions of absolute stability

Switch perspectives. We now have a continuous system $\ddt \statevec = \statemat \statevec$ and we discretize it with some numerical integrator at step size $\stepsize$ . The integrator approximates the true continuous trajectory by a discrete recurrence

\statevec_{n+1} = R(\stepsize \statemat) \statevec_n,

where $R(z)$ is the integrator’s stability function, evaluated at the matrix argument $\stepsize \statemat$ via the matrix-functional-calculus convention. Different integrators give different $R$ :

Forward (explicit) Euler: $\statevec_{n+1} = \statevec_n + \stepsize \statemat \statevec_n$ , so $R_{\text{FE}}(z) = 1 + z$ .
Backward (implicit) Euler: $\statevec_{n+1} = \statevec_n + \stepsize \statemat \statevec_{n+1}$ , giving $R_{\text{BE}}(z) = 1/(1-z)$ .
Bilinear (Tustin / trapezoidal): $R_{\text{bil}}(z) = (1 + z/2)/(1 - z/2)$ .
Exact (zero-order hold): $R_{\text{ZOH}}(z) = e^z$ . This is the integrator S5 and the Mamba line use by default (the original S4 used bilinear; see Chapter 8).

The discrete recurrence is Lyapunov-stable if and only if every eigenvalue $\mu_i$ of $R(\stepsize \statemat)$ satisfies $\abs{\mu_i} \le 1$ (the discrete-time analog of the LHP criterion is the unit disk). Since the eigenvalues of $R(\stepsize \statemat)$ are $R(\stepsize \lambda_i)$ where $\lambda_i$ are the eigenvalues of $\statemat$ , the question reduces to: for which $z = \stepsize \lambda_i$ in the complex plane is $\abs{R(z)} \le 1$ ?

Definition 2.3.

The region of absolute stability of a numerical integrator with stability function $R(z)$ is

S := \{ z \in \C : \abs{R(z)} \le 1 \}.

The integrator is A-stable if $S$ contains the entire closed left half-plane.

A-stability matters because it means: if the continuous system is asymptotically stable, the integrator preserves that stability at every step size $\stepsize > 0$ . Without A-stability, you have to pick $\stepsize$ small enough that $\stepsize \lambda_i$ lands inside the (bounded) stability region.

The four examples above:

Forward Euler: $\abs{1 + z} \le 1$ ⇔ $z$ in the closed disk of radius 1 around $-1$ . Tiny region; not A-stable. For a continuous system with eigenvalue $\lambda = -100$ (fast decay), forward Euler requires $\stepsize < 2/100$ to avoid blowup.
Backward Euler: $\abs{1/(1-z)} \le 1$ ⇔ $\abs{1-z} \ge 1$ ⇔ $z$ outside the open disk of radius 1 around $1$ . This contains the entire closed LHP. A-stable.
Bilinear (trapezoidal): $\abs{R_{\text{bil}}(z)} \le 1$ ⇔ $\Re(z) \le 0$ exactly. The stability region is exactly the closed LHP. A-stable, and the boundary coincides with the imaginary axis — a property called L-stability when combined with $R(z) \to 0$ as $\Re(z) \to -\infty$ , which bilinear does not satisfy ( $R_{\text{bil}}(-\infty) = -1$ ).
Zero-order hold (matrix exponential): $\abs{e^z} = e^{\Re(z)} \le 1$ ⇔ $\Re(z) \le 0$ . The stability region is the closed LHP. A-stable and L-stable. This is the gold standard for LTI discretization — and it’s the scheme S5, the Mamba line, and this book’s SSM chapters standardize on.

Stability regions in the complex plane for forward Euler, backward Euler, bilinear, and ZOH integrators. — Stability regions S = { z : |R(z)| ≤ 1 } for four common integrators. Forward Euler's tiny disk around -1 makes it conditionally stable only; backward Euler and bilinear/trapezoidal are A-stable; ZOH (the matrix exponential) is A-stable and L-stable, with stability region exactly the closed left half-plane. The eigenvalue distribution of a learned SSM's A matrix overlaid on these regions tells you immediately which integrators are safe to use. Produced by companions/ch02/jax/stability_regions.py.

Why this matters for SSMs

The S4 family discretizes with A-stable schemes throughout — bilinear in the original S4, ZOH in S5 — so the layer is stable at every step size for which the continuous system is stable. Mamba-1 uses ZOH as well. Mamba-3’s exponential-trapezoidal rule (Chapter 10) is a second-order generalization of bilinear that retains A-stability Lahoti et al. (2026) . The C1 research pilot (see Chapter 6) asks whether symplectic integrators — A-stable methods designed to preserve geometric invariants — can do better for complex-state SSMs whose $\statemat$ has near-imaginary eigenvalues.

When a stability-region analysis fails, it usually fails dramatically: a forward-Euler discretization of a stiff system blows up in a handful of steps. This is exactly the failure mode that careful integrator choice prevents.

For full coverage of A-stability theory and a wider zoo of integrator stability functions, Hairer–Wanner’s stiff-systems volume Hairer & Wanner (1996) is the canonical reference.

2.5 BIBO stability

The third notion looks at the input–output mapping rather than the autonomous dynamics. For the system $\ddt \statevec = \statemat \statevec + \inputmat u$ , $y = \outputmat \statevec$ , the output under a unit-impulse input is the impulse response

h(t) = \outputmat e^{\statemat t} \inputmat \quad \text{for } t \ge 0.

The system is BIBO-stable if its impulse response satisfies $\int_0^\infty \norm{h(t)} \, dt < \infty$ , equivalently if $h \in L^1$ . Under this condition, any bounded input $u$ produces a bounded output $y$ — Young’s inequality on convolutions gives the precise bound $\sup_t \norm{y(t)} \le \norm{h}_{L^1} \cdot \sup_t \abs{u(t)}$ .

Theorem 2.4.

A causal LTI system $(\statemat, \inputmat, \outputmat)$ is BIBO-stable iff all poles of its transfer function $H(s) := \outputmat (sI - \statemat)^{-1} \inputmat$ lie in the open left half-plane.

Poles of $H(s)$ are the values of $s$ where $\det(sI - \statemat) = 0$ — i.e. the eigenvalues of $\statemat$ — unless a pole–zero cancellation occurs because $\outputmat$ doesn’t “see” some eigenmode. This cancellation matters: a system can be internally unstable (some $\lambda_i$ in the right half-plane) but BIBO-stable if those unstable modes are unobservable from $\outputmat$ and unreachable from $\inputmat$ . The Kalman canonical-decomposition theorem makes this precise; for our purposes, the eigenvalues of $\statemat$ in the LHP are sufficient for both internal and BIBO stability.

For an SSM viewed as a sequence-to-sequence map, BIBO is the natural input–output stability concept: bounded inputs (typical of token embeddings) should produce bounded outputs (typical of pre-softmax logits). When training an SSM produces an $\statemat$ whose eigenvalues drift into the right half-plane, both internal (Lyapunov) and input–output (BIBO) stability fail simultaneously, and the model’s outputs blow up. Tracking the eigenvalues of $\statemat$ during training is therefore a single diagnostic for both failure modes.

2.6 Connection to Lyapunov functions

The Lyapunov-function method gives a non-spectral way to prove stability. For nonlinear systems where eigenvalues don’t directly apply, finding a function $V: \R^N \to \R$ with $V(0) = 0$ , $V(\statevec) > 0$ for $\statevec \ne 0$ , and $\dot V(\statevec) := \nabla V(\statevec) \cdot f(\statevec) < 0$ for all $\statevec \ne 0$ proves asymptotic stability of $\statevec^* = 0$ .

For our LTI case $f(\statevec) = \statemat \statevec$ , the Lyapunov equation $\statemat^\top P + P \statemat = -Q$ (with $Q$ p.d.) gives $V(\statevec) = \statevec^\top P \statevec$ as a quadratic Lyapunov function — and its existence with $P$ p.d. is equivalent to asymptotic stability (as noted in §2.2). The interest in the function-method picture for our context is mainly conceptual: it explains why “energy” arguments work to prove stability of mechanical systems (the energy is the Lyapunov function), and it generalizes immediately to nonlinear systems where the eigenvalue criterion only gives local information at fixed points.

This connects forward to Chapter 6, where the question of which discretization preserves a Lyapunov function (or, more strongly, an energy invariant) leads naturally into symplectic methods.

2.7 What’s next

Chapter 3 develops the structured-linear-algebra vocabulary (SVD, Jordan form, condition number, Toeplitz/Vandermonde/Cauchy/semiseparable structure) needed for the SSM kernel constructions in Chapters 7–9. Chapter 4 takes the discretization story started in §2.4 and develops it systematically: order conditions, accuracy classes, the Butcher tableau, the bilinear and ZOH derivations in detail. Chapter 6 picks up the A-stability theme and pushes into implicit and structure-preserving integrators — symplectic schemes, geometric integration, and the C1 pilot’s home territory.

2.8 Exercises

Six problems. Inline-collapsible solutions for the shorter ones; full solutions for the longer theory problems in §2.9.

Exercise 2.1 (computation)

Use the eigenvalue criterion (Theorem 2.1) to classify the stability of $\ddt \statevec = \statemat \statevec$ for $\statemat = \begin{pmatrix} -1 & 4 \\ -1 & -1 \end{pmatrix}$ .

Solution

Characteristic polynomial: $\det(\lambda I - \statemat) = (\lambda+1)^2 + 4 = \lambda^2 + 2\lambda + 5$ . Roots: $\lambda = -1 \pm 2i$ . Both have $\Re(\lambda) = -1 < 0$ , so the system is asymptotically stable. The non-zero imaginary part means trajectories spiral toward the origin (stable spiral).

Exercise 2.2 (computation)

For the matrix $\statemat = \begin{pmatrix} 0 & 1 \\ -1 & 0 \end{pmatrix}$ of Exercise 1.1, find the eigenvalues and classify the Lyapunov stability. Is the system asymptotically stable?

Solution

Eigenvalues: $\lambda = \pm i$ . Both have $\Re(\lambda) = 0$ , so the system is Lyapunov-stable but not asymptotically stable — trajectories are bounded (centers / closed orbits) but do not decay to zero. This is the undamped harmonic oscillator’s marginal case.

Exercise 2.3 (computation)

The forward Euler method has stability function $R(z) = 1 + z$ . For the continuous test problem $\dot x = \lambda x$ with $\lambda = -10$ (a stiff system with fast time scale), find the maximum step size $\stepsize$ for which forward Euler is stable.

Solution

Stability condition: $\abs{1 + \stepsize \lambda} \le 1$ with $\lambda = -10$ . Algebra: $\abs{1 - 10\stepsize} \le 1$ ⇔ $0 \le 10\stepsize \le 2$ ⇔ $\stepsize \le 0.2$ . So forward Euler requires $\stepsize \le 0.2$ . Any $\stepsize > 0.2$ produces exponentially growing iterates even though the continuous system decays. This is the classic motivation for implicit methods on stiff problems.

Exercise 2.4 (computation)

Show that the bilinear method $R_{\text{bil}}(z) = (1 + z/2)/(1 - z/2)$ maps the closed left half-plane $\{\Re(z) \le 0\}$ onto the closed unit disk $\{\abs{w} \le 1\}$ . (This is the Möbius map picture of the bilinear transform.)

Solution

Let $z = a + bi$ with $a \le 0$ . Compute:

\abs{R_{\text{bil}}(z)}^2 = \frac{(1+a/2)^2 + (b/2)^2}{(1-a/2)^2 + (b/2)^2}.

The denominator minus the numerator equals $(1-a/2)^2 - (1+a/2)^2 = -2a \ge 0$ (since $a \le 0$ ). So numerator $\le$ denominator and $\abs{R_{\text{bil}}(z)} \le 1$ . Equality iff $a = 0$ (purely imaginary $z$ ), so the imaginary axis maps to the unit circle and the open LHP maps to the open unit disk. ∎

Exercise 2.5 (theory) — solution in §2.9

State and prove the Lyapunov equation characterization of asymptotic stability: $\statemat$ is asymptotically stable iff for every symmetric positive-definite $Q$ , the equation $\statemat^\top P + P \statemat = -Q$ has a unique symmetric positive-definite solution $P$ .

Exercise 2.6 (theory) — solution in §2.9

Prove the BIBO criterion of Theorem 2.4: a causal LTI system is BIBO-stable iff all poles of its transfer function lie in the open left half-plane. (Hint: Use the Laplace-transform / convolution duality, and exploit the representation of the impulse response as a sum of exponentials.)

2.9 Full solutions to theory exercises

Solution to Exercise 2.5

Forward direction (Lyapunov equation ⇒ asymptotic stability): Suppose $P$ is symmetric positive-definite and satisfies $\statemat^\top P + P \statemat = -Q$ for some symmetric positive-definite $Q$ . Define $V(\statevec) = \statevec^\top P \statevec$ . Then $V > 0$ for $\statevec \ne 0$ (since $P \succ 0$ ), and along trajectories $\ddt \statevec = \statemat \statevec$ :

\dot V = \dot \statevec^\top P \statevec + \statevec^\top P \dot \statevec = \statevec^\top \statemat^\top P \statevec + \statevec^\top P \statemat \statevec = \statevec^\top (-Q) \statevec < 0.

So $V$ is a strict Lyapunov function and trajectories satisfy $V(\statevec(t)) \le V(\statevec(0)) e^{-\alpha t}$ for $\alpha > 0$ determined by the smallest eigenvalue of $Q$ over the largest of $P$ . Hence $\statevec(t) \to 0$ and the origin is asymptotically stable.

Reverse direction (asymptotic stability ⇒ Lyapunov equation has p.d. solution): Suppose $\statemat$ is asymptotically stable. Define

P := \int_0^\infty e^{\statemat^\top t} Q \, e^{\statemat t} \, dt.

The integral converges absolutely because $e^{\statemat t}$ decays exponentially. $P$ is symmetric (transpose preserves the integrand structure) and positive-definite (since $Q \succ 0$ and $e^{\statemat t}$ is invertible). Direct computation:

\statemat^\top P + P \statemat = \int_0^\infty \left( \statemat^\top e^{\statemat^\top t} Q e^{\statemat t} + e^{\statemat^\top t} Q e^{\statemat t} \statemat \right) dt = \int_0^\infty \ddt \left( e^{\statemat^\top t} Q e^{\statemat t} \right) dt = [e^{\statemat^\top t} Q e^{\statemat t}]_0^\infty = -Q,

using the decay of $e^{\statemat t}$ at $\infty$ . Uniqueness of $P$ follows from the linearity of the Lyapunov operator $\mathcal{L}(P) := \statemat^\top P + P \statemat$ : its eigenvalues are the pairwise sums $\lambda_i + \lambda_j$ of eigenvalues of $\statemat$ , so its kernel is trivial whenever no two eigenvalues sum to zero — which holds here because asymptotic stability gives $\Re(\lambda_i + \lambda_j) < 0$ for all $i, j$ . ∎

Solution to Exercise 2.6

(⇒) BIBO-stable implies poles in open LHP. Suppose the system is BIBO-stable, i.e. $h \in L^1$ . The Laplace transform $H(s) = \int_0^\infty h(t) e^{-st} dt$ then exists and is analytic for $\Re(s) \ge 0$ (since the integrand is bounded by $\abs{h(t)}$ which is integrable, the integral converges uniformly). $H$ is rational (the matrix-inverse formula gives $H(s) = \outputmat (sI - \statemat)^{-1} \inputmat$ , a ratio of polynomials with denominator $\det(sI - \statemat)$ ). A rational function analytic on the closed right half-plane has no poles there — so all poles of $H(s)$ lie in the open LHP.

(⇐) Poles in open LHP implies BIBO-stable. Suppose all poles of $H(s)$ lie in the open LHP. Partial-fraction decomposition gives $H(s) = \sum_k \frac{c_k}{(s - p_k)^{m_k}}$ where $p_k$ are the poles with multiplicity $m_k$ , and $\Re(p_k) < 0$ for all $k$ . The inverse Laplace transform of each term is $\frac{c_k t^{m_k - 1}}{(m_k - 1)!} e^{p_k t}$ , which is $L^1$ on $[0, \infty)$ because the exponential decay beats any polynomial growth in $t$ . Sum of $L^1$ functions is $L^1$ , so $h \in L^1$ and BIBO holds.

(For the multi-input multi-output case, apply the same argument entrywise to $H(s) \in \C^{M \times D}$ .) ∎

2.10 Companion code

Two JAX companions and one PyTorch companion for Chapter 2.

JAX (companions/ch02/jax/):

lyapunov_qr.py — implements the QR-based Lyapunov-exponent algorithm and applies it to the ring-of-oscillators system from Chapter 1
stability_regions.py — plots the regions of absolute stability for forward Euler, backward Euler, bilinear (trapezoidal), and ZOH in the complex plane

PyTorch (companions/ch02/torch/):

lyapunov_qr.py — the QR-based Lyapunov spectrum in idiomatic PyTorch (the jax.lax.scan-vs-eager-loop contrast with the JAX companion; compute-and-parity only, the JAX companion produces the figure).
tests/ — cross-framework parity: the torch Lyapunov spectra match their JAX counterparts.

To run from the repo root:

PYTHONPATH=. python companions/ch02/jax/lyapunov_qr.py
PYTHONPATH=. python companions/ch02/jax/stability_regions.py
PYTHONPATH=. python companions/ch02/torch/lyapunov_qr.py

Figures land in public/figures/ch02/ (referenced from §2.3 and §2.4 above).

Part I · Foundations Week 3 Published

Linear algebra for sequence models: structured matrices and conditioning

Eigenvalue and SVD decompositions, condition number, the four structured-matrix families (Toeplitz, Vandermonde, Cauchy, semiseparable), low-rank updates, and a Krylov-subspace primer — the linear-algebra vocabulary later chapters assume.

Linear algebra for sequence models: structured matrices and conditioning

Chapter 3 — at a glance

Goal: develop the structured-matrix vocabulary that the SSM kernel constructions of Chapters 7–9 assume — eigenvalue and singular-value decompositions, condition number, Toeplitz / Vandermonde / Cauchy / semiseparable structures, low-rank corrections, and a Krylov-subspace primer. This is the chapter that lets you read S4’s “diagonal-plus-low-rank” parametrization or Mamba-2’s “1-semiseparable mask” without re-deriving the underlying linear algebra each time.

Reading time: ~30 minutes prose; 60 minutes with the structured-matrix companion.

Key insight: every fast SSM algorithm depends on $\statemat$ having one of four structured forms. The forms aren’t arbitrary — they’re the structures for which matrix-vector products take $O(N \log N)$ or $O(N)$ time instead of $O(N^2)$ . Knowing which structure each architecture exploits tells you immediately what its computational sweet spot is.

3.1 Eigenvalue decomposition and Jordan normal form

Chapter 1’s matrix exponential built directly on the spectral structure of $\statemat$ . The fundamental theorem is:

Theorem 3.1.

For every matrix $\statemat \in \C^{N \times N}$ , there exist matrices $V$ (invertible) and $J$ (block-diagonal with Jordan blocks on the diagonal) such that

\statemat = V J V^{-1}.

$J$ is the Jordan normal form of $\statemat$ ; the diagonal entries of $J$ are the eigenvalues of $\statemat$ (counted with algebraic multiplicity); for each eigenvalue, the number of Jordan blocks equals its geometric multiplicity and the sum of their sizes equals its algebraic multiplicity. If every eigenvalue has equal algebraic and geometric multiplicity, all Jordan blocks have size 1 and $J$ is diagonal — $\statemat$ is diagonalizable.

A Jordan block of size $k$ for eigenvalue $\lambda$ looks like

J_k(\lambda) = \begin{pmatrix} \lambda & 1 & & \\ & \lambda & 1 & \\ & & \ddots & \ddots \\ & & & \lambda & 1 \\ & & & & \lambda \end{pmatrix} \in \C^{k \times k},

i.e. eigenvalue on the diagonal, ones on the superdiagonal, zeros elsewhere. The matrix exponential of a Jordan block is

e^{J_k(\lambda) t} = e^{\lambda t} \begin{pmatrix} 1 & t & t^2/2! & \cdots & t^{k-1}/(k-1)! \\ & 1 & t & \cdots & t^{k-2}/(k-2)! \\ & & \ddots & \ddots & \vdots \\ & & & 1 & t \\ & & & & 1 \end{pmatrix},

which is the source of the polynomial-in- $t$ factors mentioned in Chapter 1, §1.2 — they appear exactly when there are Jordan blocks of size $> 1$ .

For SSM analysis, the diagonalizable case dominates. Almost every $\statemat$ arising from random initialization or from training is diagonalizable; the non-diagonalizable case is a measure-zero, codimension-one subset of parameter space and shows up only at carefully-tuned boundaries. But structured or learned $\statemat$ — HiPPO-LegS, S4’s DPLR parametrization, the critically-damped regime of §1.3 — can carry Jordan structure by design, and Chapter 2’s Lyapunov theorem explicitly handles the defective case.

3.2 Singular value decomposition

The eigenvalue decomposition requires $\statemat$ to be square; for general (rectangular or possibly non-diagonalizable) matrices, the right tool is the singular value decomposition.

Theorem 3.2.

For every matrix $\statemat \in \R^{m \times n}$ , there exist orthogonal matrices $U \in \R^{m \times m}$ , $V \in \R^{n \times n}$ , and a diagonal-with-zeros matrix $\Sigma \in \R^{m \times n}$ with non-negative entries $\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_{\min(m,n)} \ge 0$ on the diagonal, such that

\statemat = U \Sigma V^\top.

The $\sigma_i$ are the singular values of $\statemat$ ; they are uniquely determined. The number of nonzero $\sigma_i$ equals $\rank(\statemat)$ .

The SVD has three properties that make it the workhorse of numerical linear algebra:

It always exists. Unlike the eigenvalue decomposition, no diagonalizability or even squareness assumption is needed.
The singular values give the Frobenius and operator norms. $\norm{\statemat}_2 = \sigma_1$ (operator norm, equal to the largest singular value); $\norm{\statemat}_F = \sqrt{\sum_i \sigma_i^2}$ .
It reveals low-rank structure cleanly. Truncating the SVD at rank $r < \rank(\statemat)$ — keeping only the top $r$ singular values and corresponding columns of $U, V$ — gives the best rank- $r$ approximation to $\statemat$ in both Frobenius and operator norms (the Eckart–Young theorem).

For square $\statemat$ , the singular values are related to but distinct from the eigenvalues. The relationship $\sigma_i(\statemat)^2 = \lambda_i(\statemat^\top \statemat)$ (using descending order on both sides) lets you compute singular values via an eigenvalue problem on the Gram matrix $\statemat^\top \statemat$ — though in practice the QR-iteration-based SVD algorithm (LAPACK’s gesdd) is more numerically stable.

The SVD shows up explicitly in:

Chapter 2’s Lyapunov-exponent computation (singular values of the propagated frame matrix).
Chapter 8’s HiPPO matrix analysis (the conditioning of the projection operator is given by its SVD).
Chapter 12’s delta-rule lineage (DeltaNet’s state update is a rank-1 SVD correction).

For a textbook treatment of the SVD’s properties and algorithms, Trefethen–Bau Trefethen & Bau (1997) Chapters 4–5 are the standard reference. Golub–Van Loan Golub & Van Loan (2013) covers the algorithmic details exhaustively.

3.3 Condition number

The condition number of a matrix $\statemat \in \R^{N \times N}$ (with respect to the operator norm) is

\kappa(\statemat) := \norm{\statemat}_2 \cdot \norm{\statemat^{-1}}_2 = \frac{\sigma_1(\statemat)}{\sigma_N(\statemat)},

the ratio of the largest to smallest singular value (when $\statemat$ is invertible; otherwise $\kappa = \infty$ ). The condition number measures how much $\statemat$ amplifies relative input perturbations into relative output perturbations: if $\statemat \statevec = b$ and $b$ has relative error $\delta b / \norm{b}$ , then the relative error in the computed solution can be as large as $\kappa(\statemat) \cdot \delta b / \norm{b}$ .

The qualitative scale: $\kappa = 1$ is the orthogonal case (no amplification); $\kappa \approx 10^k$ means losing roughly $k$ decimal digits of precision when solving a linear system. Double-precision floating point has about 16 decimal digits; matrices with $\kappa \gtrsim 10^{16}$ are numerically singular.

For SSMs, condition number matters in three places:

HiPPO matrix construction. The HiPPO-LegS matrix’s condition number grows polynomially with $N$ (empirically $\kappa \sim N^2$ ) — far below the exponential blow-up of a generic ill-conditioned family like the Hilbert matrix, though above a random Gaussian’s $\kappa \sim N$ . HiPPO-LegS is the standard SSM initialization for its optimal-polynomial-projection memory Gu et al. (2020) , not for being the best-conditioned matrix; its merely-polynomial growth is what keeps large- $N$ initialization numerically tractable. A subtler point: HiPPO-LegS is highly non-normal, so its eigenvector matrix is exponentially ill-conditioned in $N$ and naive diagonalization is numerically fragile Yu et al. (2023) .
S4 kernel computation. S4’s Vandermonde-Cauchy kernel construction requires inverting a matrix with structured but potentially ill-conditioned columns. The paper carefully handles the conditioning; naive implementations don’t.
Mamba-3’s complex-state design. Chapter 10 discusses how Mamba-3 deliberately places eigenvalues in a region of the complex plane where the discrete-time map remains well-conditioned across the integration step Lahoti et al. (2026) .

Growth of the condition number of various structured matrices as size N increases. — Condition number κ(A) as a function of matrix size N for: a random Gaussian matrix (κ grows roughly linearly in N, modulo log factors); a Hilbert matrix (κ grows exponentially — the textbook example of ill-conditioning); the HiPPO-LegS matrix (κ grows polynomially, roughly as N², far slower than the Hilbert matrix's exponential growth). Produced by companions/ch03/jax/condition_number.py.

3.4 Structured matrix families

Four classes of structured matrices appear throughout the SSM literature. Each is parameterized by $O(N)$ numbers rather than the $O(N^2)$ a general matrix needs, and each admits fast matrix-vector products via specialized algorithms.

Toeplitz matrices

A Toeplitz matrix is constant along each diagonal:

T = \begin{pmatrix} t_0 & t_{-1} & t_{-2} & \cdots & t_{-(n-1)} \\ t_1 & t_0 & t_{-1} & & \vdots \\ t_2 & t_1 & t_0 & \ddots & t_{-2} \\ \vdots & & \ddots & \ddots & t_{-1} \\ t_{n-1} & \cdots & t_2 & t_1 & t_0 \end{pmatrix}.

Parameterized by $2n - 1$ values $(t_{-(n-1)}, \ldots, t_{n-1})$ . Toeplitz matrices are the matrix form of convolutions: if $T$ is Toeplitz with first column $(t_0, t_1, \ldots, t_{n-1})$ and first row $(t_0, 0, \ldots, 0)$ (lower-triangular Toeplitz), then $T \statevec$ is the discrete convolution of the kernel $(t_0, t_1, \ldots)$ with $\statevec$ . The FFT-based convolution algorithm computes $T \statevec$ in $O(n \log n)$ time.

The LTI SSM convolutional view (Chapter 8) is exactly this: the operator that maps the input sequence $u$ to the output sequence $y$ via $y(t) = \int_0^t h(t-s) u(s) ds$ is a Toeplitz matrix at the discretization level, with first column equal to the discretized impulse response $h(0), h(\stepsize), h(2\stepsize), \ldots$ .

Vandermonde matrices

A Vandermonde matrix has the form

V = \begin{pmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^{n-1} \\ 1 & x_2 & x_2^2 & \cdots & x_2^{n-1} \\ \vdots & & & & \vdots \\ 1 & x_n & x_n^2 & \cdots & x_n^{n-1} \end{pmatrix},

parameterized by $n$ values $(x_1, \ldots, x_n)$ . The defining property is that $(V \statevec)_i = \sum_{j=0}^{n-1} v_j x_i^j$ is a polynomial in $x_i$ evaluated at the node $x_i$ . So Vandermonde matrices implement polynomial evaluation at $n$ points as a linear map.

Vandermonde matrices are notoriously ill-conditioned for nodes on the real line (condition number can grow exponentially), but well-behaved for nodes on the unit circle — which is why the FFT (Vandermonde with $x_i$ being roots of unity) is numerically stable.

The S4 kernel computation uses Vandermonde structure: evaluating $\sum_j c_j \lambda_j^k$ over $k = 0, 1, \ldots, L-1$ is exactly the Vandermonde-style polynomial evaluation. The S4 paper Gu et al. (2022) uses Cauchy-matrix tricks (next subsection) to make this evaluation stable.

Cauchy matrices

A Cauchy matrix has entries

C_{ij} = \frac{1}{x_i - y_j},

parameterized by $n + m$ values $(x_1, \ldots, x_n, y_1, \ldots, y_m)$ with the $\{x_i\} \cap \{y_j\}$ empty (to avoid zero denominators). Cauchy matrices are dense — every entry depends on both indices — but the very structured dependence enables fast algorithms: a Cauchy matrix-vector product can be computed in $O(n \log^2 n)$ time using the fast multipole method.

Cauchy matrices appear in two places in the SSM literature: the S4 paper uses them as a numerically stable replacement for direct Vandermonde-style kernel evaluation, and the diagonal-plus-low-rank parametrization of $\statemat$ in S4 has a Cauchy-matrix interpretation when viewed through the partial-fraction decomposition of its transfer function.

Semiseparable matrices

A rank- $k$ semiseparable matrix has the property that every submatrix lying strictly above the main diagonal (and every submatrix lying strictly below) has rank at most $k$ . Equivalently, the upper and lower triangular parts each have rank- $k$ structure.

The 1-semiseparable case — every off-diagonal block has rank at most 1 — is the structure exploited by Mamba-2’s SSD framework Dao & Gu (2024) . The 1-semiseparable lower-triangular matrix corresponding to a scalar-times-identity SSM is, explicitly,

M = \begin{pmatrix} 1 & & & \\ a_1 & 1 & & \\ a_1 a_2 & a_2 & 1 & \\ a_1 a_2 a_3 & a_2 a_3 & a_3 & 1 \\ \vdots & & & \ddots \end{pmatrix},

where $a_i$ is the (scalar) recurrence coefficient at step $i$ . The $(i, j)$ entry for $i > j$ is the product $a_{j+1} a_{j+2} \cdots a_i$ . The SSD insight is that this matrix-vector product can be computed two ways: as a scan (the SSM view, $O(L)$ time) or as a structured matrix multiply (the attention view, $O(L^2)$ time but matmul-friendly on GPUs).

When $L$ is moderate (say $\le 8192$ ) and the GPU favors matmul over scan, the matrix view wins; when $L$ is huge, the scan view wins. Mamba-2’s chunked algorithm interpolates: matrix-multiplies within chunks of size $\sim 256$ , scans across chunks. This is the structural reason Mamba-2 is faster than Mamba-1 on long sequences without sacrificing parallelism.

Heatmap visualizations of Toeplitz, Vandermonde, Cauchy, and semiseparable matrices showing their structural patterns. — Structure of the four matrix families for N=8: Toeplitz (constant along diagonals — convolution); Vandermonde (powers of node values in rows); Cauchy (each entry is 1/(x_i - y_j)); 1-semiseparable lower triangular (each lower-triangular entry is the product of a prefix of off-diagonal factors). Color encodes log|entry|. Produced by companions/ch03/jax/structured_matrices.py.

3.5 Low-rank corrections and rank-1 updates

A recurring pattern: take a structured matrix $S$ (diagonal, banded, or semiseparable) and add a small low-rank correction $U V^\top$ where $U, V \in \R^{n \times k}$ with $k \ll n$ . The resulting matrix $S + U V^\top$ retains nearly the storage and matvec efficiency of $S$ , and admits a closed-form inverse via the Sherman–Morrison–Woodbury identity:

(S + U V^\top)^{-1} = S^{-1} - S^{-1} U (I + V^\top S^{-1} U)^{-1} V^\top S^{-1}.

The cost is dominated by inverting the $k \times k$ matrix $I + V^\top S^{-1} U$ , not the $n \times n$ outer matrix.

This pattern appears in S4 explicitly: the HiPPO-LegS matrix isn’t diagonal, but it is “diagonal plus low-rank” — specifically, normal plus low-rank, which is what makes the kernel computation tractable. The S4 paper’s main algorithmic contribution is a specialized Sherman–Morrison-style trick for this exact decomposition.

The pattern recurs even more visibly in DeltaNet (Chapter 12), where the state update is

S_{t+1} = S_t + \beta_t (v_t - S_t k_t) k_t^\top = S_t (I - \beta_t k_t k_t^\top) + \beta_t v_t k_t^\top.

The middle factor is a rank-1 correction (an identity minus a rank-1 outer product), and the whole expression is one explicit (forward-Euler) step of an online gradient-descent update on the per-token association loss — the DeltaNet view Yang et al. (2024) developed in Chapter 12. Longhorn Liu et al. (2024) is the implicit (backward-Euler) cousin, reaching the same online-ODE picture by solving at the endpoint.

The key takeaway: structured + low-rank corrections give you efficient matrix-vector products without giving up expressiveness. This is the design pattern unifying S4, DeltaNet, GLA, and the Mamba-2 SSD framework — each is a different choice of base structure and correction pattern.

3.6 Krylov subspaces: a primer

The Krylov subspace of order $k$ generated by a matrix $\statemat$ and a vector $b$ is

\mathcal{K}_k(\statemat, b) := \text{span}\{ b, \statemat b, \statemat^2 b, \ldots, \statemat^{k-1} b \}.

Krylov subspaces are the workhorse of iterative methods for large sparse linear systems: GMRES, conjugate gradient, Arnoldi iteration, Lanczos. All of them construct an orthonormal basis for $\mathcal{K}_k(\statemat, b)$ and solve a small ( $k \times k$ ) projected problem instead of the original $n \times n$ one.

For SSMs, the Krylov picture is conceptual rather than algorithmic. The point is that $\mathcal{K}_k(\statemat, b)$ contains all the information the recurrence $\statevec_{t+1} = \statemat \statevec_t + \inputmat u_t$ can extract from the initial condition $b$ in $k$ steps. If the system has $n - k$ eigenvalues that are decoupled from the initial direction $b$ (the so-called “unreachable subspace”), the recurrence cannot recover them, and the effective dimension of the SSM is $k < n$ . This is one rigorous reading of the Mamba copying-limitation results — the recurrence’s expressive ceiling is set by the Krylov dimension of $\statemat$ relative to the input.

A full treatment of Krylov methods would fill its own chapter; the curriculum revisits the picture in Chapter 8 (where the S4 kernel can be viewed as a structured Krylov-projection problem) and in Chapter 16’s empirical-methodology discussion of why some architectures can copy strings exponentially longer than others.

For a textbook coverage of Krylov-subspace methods, Trefethen–Bau Trefethen & Bau (1997) Chapters 32–40 give the full algorithmic treatment.

3.7 What’s next

You now have the linear-algebra vocabulary the rest of the book assumes. Chapter 4 picks up the discretization thread (started in Chapter 2, §2.4) and develops it systematically: order conditions, accuracy classes, the Butcher tableau, the bilinear and ZOH derivations in detail. Chapter 7 introduces the HiPPO theory that connects orthogonal-basis approximation theory to the $\statemat$ matrix structure of S4. Chapter 8 then shows how all four structured-matrix families combine in the S4 / S4D / S5 family.

If you’re impatient, Chapter 9’s Mamba-1/2 presentation is the payoff for the SSD discussion of §3.4 — the selective scan’s matmul-friendly chunkwise algorithm is exactly the 1-semiseparable matrix product algorithm.

3.8 Exercises

Six problems. Inline solutions for the shorter ones; full proofs for the theory exercises in §3.9.

Exercise 3.1 (computation)

Compute the Jordan normal form of $\statemat = \begin{pmatrix} 2 & 1 \\ 0 & 2 \end{pmatrix}$ .

Solution

The matrix is already in Jordan form: a single $2 \times 2$ Jordan block with eigenvalue $\lambda = 2$ . There is one eigenvalue (algebraic multiplicity 2) but only one linearly independent eigenvector (geometric multiplicity 1), giving a defective $J_2(2)$ . The transformation matrix $V$ is the identity ( $J = \statemat$ directly).

Exercise 3.2 (computation)

Compute the singular values of $\statemat = \begin{pmatrix} 3 & 0 \\ 0 & 4 \\ 0 & 0 \end{pmatrix}$ and its operator norm.

Solution

The matrix is already in SVD form (with $U = I_3$ and $V = I_2$ ). Singular values are $\sigma_1 = 4$ , $\sigma_2 = 3$ . Operator norm: $\norm{\statemat}_2 = \sigma_1 = 4$ . (Note that swapping the diagonal entries doesn’t change the SVD; singular values are always sorted descending.)

Exercise 3.3 (computation)

For the Vandermonde matrix with nodes $(x_1, x_2, x_3) = (1, 2, 3)$ , write out the matrix and compute its determinant. Compare to the closed-form Vandermonde determinant $\det V = \prod_{i < j} (x_j - x_i)$ .

Solution

$V = \begin{pmatrix} 1 & 1 & 1 \\ 1 & 2 & 4 \\ 1 & 3 & 9 \end{pmatrix}.$

Direct computation: $\det V = 1 \cdot (2 \cdot 9 - 4 \cdot 3) - 1 \cdot (1 \cdot 9 - 4 \cdot 1) + 1 \cdot (1 \cdot 3 - 2 \cdot 1) = 6 - 5 + 1 = 2$ .

Closed form: $(2-1)(3-1)(3-2) = 1 \cdot 2 \cdot 1 = 2$ . ✓

Exercise 3.4 (computation)

Verify the Sherman–Morrison identity numerically: pick a random invertible $3 \times 3$ matrix $S$ , vectors $u, v \in \R^3$ , and check that $(S + u v^\top)^{-1}$ matches the closed-form expression to machine precision.

Solution

import numpy as np
rng = np.random.default_rng(0)
S = rng.standard_normal((3, 3))
u = rng.standard_normal(3)
v = rng.standard_normal(3)
direct = np.linalg.inv(S + np.outer(u, v))
S_inv = np.linalg.inv(S)
factor = 1.0 + v @ S_inv @ u
sherman = S_inv - np.outer(S_inv @ u, v @ S_inv) / factor
print(np.allclose(direct, sherman))  # True

The factor $1 + v^\top S^{-1} u$ must be non-zero (the Sherman–Morrison identity fails when it is, indicating that $S + u v^\top$ is singular).

Exercise 3.5 (theory) — solution in §3.9

Prove the Eckart–Young theorem: the best rank- $k$ approximation to a matrix $\statemat$ in the Frobenius norm is obtained by truncating its SVD at rank $k$ . That is, if $\statemat = U \Sigma V^\top$ with singular values $\sigma_1 \ge \cdots \ge \sigma_n$ , then $\statemat_k := \sum_{i \le k} \sigma_i u_i v_i^\top$ minimizes $\norm{\statemat - B}_F$ over all matrices $B$ of rank $\le k$ .

Exercise 3.6 (theory) — solution in §3.9

Prove that any Toeplitz matrix $T \in \R^{n \times n}$ admits a matrix-vector product in $O(n \log n)$ time via the FFT. Specifically, show that any $n \times n$ Toeplitz $T$ can be embedded in a $2n \times 2n$ circulant matrix, whose matvec is exactly the FFT-IFFT pair applied to the kernel and input.

3.9 Full solutions to theory exercises

Solution to Exercise 3.5

The proof uses two ingredients: (a) the SVD’s unitary invariance of the Frobenius norm, and (b) the optimality of truncated diagonal matrices.

Setup. Let $\statemat = U \Sigma V^\top$ be the SVD with $\Sigma = \diag(\sigma_1, \ldots, \sigma_n)$ and $\sigma_1 \ge \cdots \ge \sigma_n \ge 0$ . For any $B$ of rank $\le k$ , write $B = U \widetilde B V^\top$ where $\widetilde B := U^\top B V$ also has rank $\le k$ (rank is invariant under invertible transformations). The Frobenius norm is invariant under orthogonal transformations:

\norm{\statemat - B}_F = \norm{U(\Sigma - \widetilde B) V^\top}_F = \norm{\Sigma - \widetilde B}_F.

So the problem reduces to: minimize $\norm{\Sigma - \widetilde B}_F$ over rank- $\le k$ matrices $\widetilde B$ .

Diagonal reduction. Write $\widetilde B$ with its own SVD $\widetilde B = X D Y^\top$ where $D = \diag(d_1, \ldots, d_k, 0, \ldots, 0)$ . Then

\norm{\Sigma - \widetilde B}_F^2 = \tr((\Sigma - \widetilde B)^\top (\Sigma - \widetilde B)) = \norm{\Sigma}_F^2 - 2 \tr(\Sigma \widetilde B^\top) + \norm{\widetilde B}_F^2.

The middle trace term $\tr(\Sigma \widetilde B^\top) \le \sum_{i \le k} \sigma_i d_i$ by the von-Neumann trace inequality (with equality when $\widetilde B$ ‘s singular vectors align with $\Sigma$ ‘s — i.e. $\widetilde B$ is diagonal in the same basis as $\Sigma$ ). Optimizing $d_i$ to maximize this term subject to rank $\le k$ gives $d_i = \sigma_i$ for $i \le k$ and $d_i = 0$ for $i > k$ , i.e. $\widetilde B = \diag(\sigma_1, \ldots, \sigma_k, 0, \ldots, 0)$ .

Substituting back, the minimum value is $\norm{\Sigma - \widetilde B^*}_F^2 = \sum_{i > k} \sigma_i^2$ , achieved by the truncated SVD $\statemat_k = U \diag(\sigma_1, \ldots, \sigma_k, 0, \ldots) V^\top = \sum_{i \le k} \sigma_i u_i v_i^\top$ . ∎

(The proof for the operator norm follows the same structure but uses Weyl’s interlacing inequality instead of von-Neumann’s trace inequality; see Golub–Van Loan Golub & Van Loan (2013) for the detailed argument.)

Solution to Exercise 3.6

Let $T \in \R^{n \times n}$ be Toeplitz with entries $T_{ij} = t_{i-j}$ for some kernel $(t_{-(n-1)}, \ldots, t_{n-1})$ .

Step 1 — Circulant embedding. Define a circulant matrix $C \in \R^{2n \times 2n}$ with first column

c = (t_0, t_1, t_2, \ldots, t_{n-1}, 0, t_{-(n-1)}, t_{-(n-2)}, \ldots, t_{-1})^\top.

The first $n$ rows and first $n$ columns of $C$ exactly reproduce $T$ (the zero in the middle of $c$ is the “buffer” that prevents wrap-around contamination).

Step 2 — Padded matvec. Given $\statevec \in \R^n$ , form $\widetilde \statevec = (\statevec, 0)^\top \in \R^{2n}$ (zero-padded). Then $(C \widetilde \statevec)_{1:n} = T \statevec$ : the first $n$ entries of the circulant product are exactly the Toeplitz product, because the zero-padding ensures the wrap-around portion of the convolution doesn’t contaminate the top half.

Step 3 — Circulant matvec via FFT. Every $2n \times 2n$ circulant matrix $C$ is diagonalized by the discrete Fourier transform: $C = F^{-1} \diag(F c) F$ , where $F$ is the $2n$ -point DFT matrix. So

C \widetilde \statevec = F^{-1} \, (\diag(F c) \cdot F \widetilde \statevec) = F^{-1} \, ((F c) \odot (F \widetilde \statevec)),

where $\odot$ is element-wise product. Both FFTs and the IFFT take $O(n \log n)$ time; the element-wise product is $O(n)$ .

Total cost. $O(n \log n)$ for the FFTs + $O(n)$ for the element-wise multiply + $O(n)$ to extract the first $n$ entries = $O(n \log n)$ . ∎

This is the exact algorithm used to compute LTI SSM convolutions in S4-family implementations: precompute the kernel $h(0), h(\stepsize), \ldots, h((L-1)\stepsize)$ once, then convolve with any input in $O(L \log L)$ time.

3.10 Companion code

Two JAX companions and one PyTorch companion for Chapter 3.

JAX (companions/ch03/jax/):

condition_number.py — plots κ(A) growth for random Gaussian, Hilbert, and HiPPO-LegS matrices as N grows; produces Figure 3.1
structured_matrices.py — constructs Toeplitz / Vandermonde / Cauchy / 1-semiseparable matrices for N=8 and visualizes structural patterns; produces Figure 3.2

PyTorch (companions/ch03/torch/):

condition_number.py — the same κ(A) conditioning sweep (HiPPO-LegS / Hilbert / Gaussian) in idiomatic PyTorch (compute-and-parity only; the JAX companion produces Figure 3.1).
tests/ — cross-framework parity: the torch condition numbers match their JAX counterparts.

To run from the repo root:

PYTHONPATH=. python companions/ch03/jax/condition_number.py
PYTHONPATH=. python companions/ch03/jax/structured_matrices.py
PYTHONPATH=. python companions/ch03/torch/condition_number.py

Figures land in public/figures/ch03/.

Part I · Foundations Week 4 Published

Discretization theory: ZOH, bilinear, exponential families

One-step discretizations of linear ODEs — zero-order hold, bilinear (Tustin), and exponential-trapezoidal — with the Lax equivalence theorem as the unifying convergence criterion.

Discretization theory: ZOH, bilinear, exponential families

Chapter 4 — at a glance

Goal: take the continuous linear state-space system of Chapter 1, $\ddt \statevec(t) = \statemat \statevec(t) + \inputmat u(t)$ , and turn it into a discrete recurrence $\statevec_{k+1} = \discA \statevec_k + \discB u_k$ that a computer can step through one time index at a time. Three schemes do the heavy lifting in this book: zero-order hold (ZOH, the S5/Mamba default and this book’s workhorse), bilinear (Tustin, the second-order signal-processing classic — and the original S4 paper’s choice), and exponential-trapezoidal (Mamba-3’s pick for stiff dynamics). By the end you can derive all three from first principles, prove each preserves stability for left-half-plane $\statemat$ , and read the empirical order of accuracy off a log-log slope.

Reading time: ~40 minutes prose; 60–90 minutes with the JAX and Julia companions.

Key insight: discretization is not a single procedure — it is a family of one-step methods parameterized by what each scheme assumes about the input between samples. ZOH assumes the input is constant on $[t_k, t_{k+1})$ . Bilinear and exp-trapezoidal interpolate linearly. The interpolation assumption determines order of accuracy; the exponential factor $e^{\statemat \stepsize}$ that appears in ZOH and exp-trap is what makes those schemes A-stable. We will see in Chapter 5 that this exponential is also why their stability regions are unbounded.

4.1 The discretization problem

Chapter 1 wrote the linear state-space system as a continuous-time ODE

\ddt \statevec(t) = \statemat \statevec(t) + \inputmat u(t), \qquad y(t) = \outputmat \statevec(t),

defined for every $t \ge 0$ . Computers do not handle “every $t$ ”; they handle a finite grid $t_k = k \stepsize$ for $k = 0, 1, 2, \ldots$ with step size $\stepsize > 0$ . A discretization is a rule for building a recurrence

\statevec_{k+1} = \discA \statevec_k + \discB u_k

whose iterates $\statevec_k$ approximate the continuous trajectory at the grid points, $\statevec_k \approx \statevec(t_k)$ . The two discrete matrices $\discA, \discB$ are functions of the continuous matrices $\statemat, \inputmat$ and of the step size $\stepsize$ . Different functions give different discretizations.

There is no single canonical choice. Every discrete recurrence in this book — the S4 layer, the Mamba selective scan, the exp-trapezoidal scheme of Mamba-3 Lahoti et al. (2026) — picks a specific rule for translating $(\statemat, \inputmat, \stepsize) \mapsto (\discA, \discB)$ . The choice matters: a poor discretization can take a perfectly stable continuous system and produce a discrete recurrence that diverges, fails to converge as $\stepsize \to 0$ , or accumulates phase error on long horizons. Numerical-analysis textbooks Hairer et al. (1993) spend hundreds of pages on these failure modes. We will compress them into three properties and three schemes.

A useful warm-up is forward Euler, the simplest possible scheme:

\statevec_{k+1} = \statevec_k + \stepsize (\statemat \statevec_k + \inputmat u_k) = (I + \stepsize \statemat) \statevec_k + \stepsize \inputmat u_k.

This is what you get by replacing $\ddt \statevec$ with the forward difference $(\statevec_{k+1} - \statevec_k)/\stepsize$ and freezing the input at $u_k$ . The discrete matrices are $\discA = I + \stepsize \statemat$ and $\discB = \stepsize \inputmat$ . Forward Euler is the cleanest illustration of what can go wrong: it is only conditionally stable, meaning that for any fixed $\statemat$ with $\operatorname{Re}(\lambda_i) < 0$ there is a maximum step size beyond which $\discA = I + \stepsize \statemat$ acquires an eigenvalue of modulus $> 1$ and the recurrence blows up. The three schemes in §4.3–§4.5 fix this defect — each in its own way, with its own trade-off.

4.2 The Lax equivalence theorem

To compare schemes you need a definition of “correct.” The standard one comes from the Lax equivalence theorem Lax & Richtmyer (1956) , which decomposes correctness into two ingredients: consistency and stability.

A one-step scheme $\statevec_{k+1} = \Phi(\stepsize)(\statevec_k, u_k)$ is consistent with the ODE if its local truncation error — the residual you obtain by plugging the exact continuous trajectory into the discrete recurrence — vanishes faster than $\stepsize$ as $\stepsize \to 0$ . Formally,

\tau(\stepsize) := \frac{\statevec(t + \stepsize) - \Phi(\stepsize)(\statevec(t), u(t))}{\stepsize} \to 0 \quad \text{as } \stepsize \to 0.

If $\tau(\stepsize) = O(\stepsize^p)$ the scheme is $p$ -th order consistent. Higher $p$ means the per-step error shrinks faster with $\stepsize$ .

A scheme is zero-stable if small perturbations $\delta$ injected at one step propagate boundedly to all future steps. For linear schemes this reduces to a uniform bound on the powers of $\discA$ : there exists $K$ independent of $\stepsize$ such that $\norm{\discA^k} \le K$ for all $k$ with $k \stepsize \le T$ , for any fixed horizon $T$ . Equivalently, the spectral radius $\rho(\discA) \le 1 + O(\stepsize)$ .

The two ingredients combine.

Theorem 4.1.

(Lax equivalence.) For a linear, well-posed initial-value problem, a one-step scheme is convergent — meaning $\max_{k \stepsize \le T} \norm{\statevec_k - \statevec(t_k)} \to 0$ as $\stepsize \to 0$ — if and only if it is both consistent and zero-stable.

Convergence is what you actually want: that the discrete iterates approach the true continuous trajectory as you refine the grid. The theorem says you cannot get there by being consistent alone (you also need stability) or by being stable alone (you also need consistency). Both are necessary, and together they are sufficient. This is the convergence criterion every scheme in this chapter must pass.

For a stable LTI test problem — meaning $\statemat$ has all eigenvalues in the open left half-plane — there is a stronger notion sitting on top of zero-stability: a scheme is A-stable if $\rho(\discA) \le 1$ for every step size $\stepsize > 0$ , not just for sufficiently small $\stepsize$ . Forward Euler is not A-stable; ZOH, bilinear, and exp-trapezoidal are. The geometry of A-stability — the set of $\statemat \stepsize$ values for which the scheme behaves well — is the subject of Chapter 5.

4.3 Zero-order hold (ZOH)

The first scheme assumes the input is piecewise constant between samples: $u(t) = u_k$ for $t \in [t_k, t_{k+1})$ . Under that assumption the inhomogeneous ODE solves exactly on a single interval. Starting from $\statevec(t_k) = \statevec_k$ ,

\statevec(t_{k+1}) = e^{\statemat \stepsize} \statevec_k + \int_{t_k}^{t_{k+1}} e^{\statemat (t_{k+1} - s)} \inputmat \, u_k \, ds = e^{\statemat \stepsize} \statevec_k + \statemat^{-1}\!\left(e^{\statemat \stepsize} - I\right) \inputmat \, u_k.

Identifying the two matrices,

\discA = e^{\statemat \stepsize}, \qquad \discB = \statemat^{-1}\!\left(e^{\statemat \stepsize} - I\right) \inputmat.

This is the zero-order hold (ZOH) discretization. The hold refers to the input assumption; “zero-order” refers to the polynomial degree of the held signal (a constant is a zero-degree polynomial).

A practical wrinkle: writing $\discB$ as $\statemat^{-1}(e^{\statemat \stepsize} - I) \inputmat$ requires $\statemat$ to be invertible. The HiPPO matrix of Chapter 7 is invertible; some structured $\statemat$ in later chapters are not. The augmented matrix exponential trick Van Loan (1978) handles both cases uniformly. Build the block matrix

M = \begin{pmatrix} \statemat \stepsize & \inputmat \stepsize \\ 0 & 0 \end{pmatrix} \in \R^{(N+P) \times (N+P)},

where $P$ is the input dimension ( $D$ in Chapter 1’s notation). Then

\exp(M) = \begin{pmatrix} e^{\statemat \stepsize} & \statemat^{-1}(e^{\statemat \stepsize} - I) \inputmat \\ 0 & I \end{pmatrix},

and a single expm call extracts both $\discA$ (top-left block) and $\discB$ (top-right block) without ever inverting $\statemat$ . The block identity goes back to Van Loan Van Loan (1978) ; the JAX companion discretization_comparison.py uses it.

Proposition 4.2.

(ZOH preserves Lyapunov stability.) If every eigenvalue $\lambda_i$ of $\statemat$ has $\operatorname{Re}(\lambda_i) < 0$ and $\stepsize > 0$ , then every eigenvalue $\mu_i = e^{\lambda_i \stepsize}$ of $\discA = e^{\statemat \stepsize}$ satisfies $\abs{\mu_i} < 1$ . ZOH is therefore A-stable.

The proof is one line: $\abs{e^{\lambda_i \stepsize}} = e^{\operatorname{Re}(\lambda_i) \stepsize} < 1$ when $\operatorname{Re}(\lambda_i) < 0$ and $\stepsize > 0$ . Stability is preserved for every positive step size, which is what A-stability means.

Two further properties are worth highlighting. First, for the autonomous ODE ( $u \equiv 0$ ), ZOH is exact: the recurrence $\statevec_{k+1} = e^{\statemat \stepsize} \statevec_k$ matches $\statevec(t_{k+1}) = e^{\statemat \stepsize} \statevec(t_k)$ without any error at all. This is unusual — most discretizations have nonzero error even on autonomous systems. Second, for forced systems with $u \in C^1$ , ZOH is first-order accurate: the per-step error in the forcing integral is $O(\stepsize^2)$ (because the integrand $e^{\statemat (t_{k+1} - s)} \inputmat (u(s) - u_k)$ is $O(\stepsize)$ over an interval of length $\stepsize$ ), and after $T/\stepsize$ steps the cumulative error is $O(\stepsize)$ . The empirical convergence rate plotted in the companion log-log error figure confirms slope $\approx 1$ .

Continuous eigenvalues in the complex plane (left half-plane) map to discrete eigenvalues inside the unit disk after ZOH discretization. — ZOH preserves stability: continuous eigenvalues in the open left half-plane (left panel) map to discrete eigenvalues strictly inside the unit disk (right panel) via $\mu_i = e^{\lambda_i \Delta}$. The mapping is bijective and depends on the step size $\Delta$. Produced by companions/ch04/jax/discretization_comparison.py.

4.4 The bilinear (Tustin) transform

The second scheme drops the piecewise-constant assumption and instead approximates the integral on $[t_k, t_{k+1}]$ by the trapezoidal rule: $\int_{t_k}^{t_{k+1}} f(s) \, ds \approx \tfrac{\stepsize}{2}(f(t_k) + f(t_{k+1}))$ . Applied to the inhomogeneous ODE this gives an implicit recurrence,

\statevec_{k+1} = \statevec_k + \tfrac{\stepsize}{2}(\statemat \statevec_k + \inputmat u_k) + \tfrac{\stepsize}{2}(\statemat \statevec_{k+1} + \inputmat u_{k+1}),

which one solves for $\statevec_{k+1}$ :

\statevec_{k+1} = \left(I - \tfrac{\stepsize}{2}\statemat\right)^{-1}\left[\left(I + \tfrac{\stepsize}{2}\statemat\right) \statevec_k + \tfrac{\stepsize}{2} \inputmat (u_k + u_{k+1})\right].

Reading off $\discA$ and $\discB$ ,

\discA = \left(I - \tfrac{\stepsize}{2}\statemat\right)^{-1}\!\left(I + \tfrac{\stepsize}{2}\statemat\right), \qquad \discB = \left(I - \tfrac{\stepsize}{2}\statemat\right)^{-1} \stepsize \, \inputmat.

This is the bilinear transform, also called the Tustin transform after Arnold Tustin’s 1947 paper introducing it in control theory Tustin (1947) . The original S4 paper uses bilinear discretization (S4D supports either bilinear or ZOH); so do many signal-processing toolboxes.

The bilinear formula has a striking geometric interpretation. Restricted to a single eigenvalue $\lambda$ of $\statemat$ (commuting through the simultaneous diagonalization), the map $\lambda \mapsto \mu$ that sends a continuous eigenvalue to its discrete image is

\mu = \frac{1 + \tfrac{\stepsize}{2}\lambda}{1 - \tfrac{\stepsize}{2}\lambda}.

This is a Möbius transformation (a.k.a. fractional linear transformation) of the complex plane: $z \mapsto (az+b)/(cz+d)$ with $ad - bc \ne 0$ . Möbius transformations are conformal — angle-preserving — and they map circles-or-lines to circles-or-lines. The specific Möbius map above sends the imaginary axis exactly to the unit circle and sends the open left half-plane exactly to the open unit disk.

Proposition 4.3.

(Bilinear preserves Lyapunov stability.) If every eigenvalue $\lambda_i$ of $\statemat$ has $\operatorname{Re}(\lambda_i) < 0$ and $\stepsize > 0$ , then every eigenvalue $\mu_i$ of $\discA = (I - \tfrac{\stepsize}{2}\statemat)^{-1}(I + \tfrac{\stepsize}{2}\statemat)$ satisfies $\abs{\mu_i} < 1$ . Bilinear is therefore A-stable.

The proof is geometric: $z := \tfrac{\stepsize}{2}\lambda_i$ has $\operatorname{Re}(z) < 0$ , and $\abs{1+z} < \abs{1-z}$ iff $\operatorname{Re}(z) < 0$ (square both sides and cancel). Algebraically, $\abs{(1+z)/(1-z)}^2 = \abs{1+z}^2/\abs{1-z}^2 < 1$ , which is exactly $\abs{\mu_i} < 1$ . The full proof, with the Möbius geometry spelled out, is Exercise 4.5 in §4.9.

Bilinear is second-order accurate for smooth forcing — the trapezoidal-rule local truncation error is $O(\stepsize^3)$ , giving global error $O(\stepsize^2)$ . Unlike ZOH, it is not exact even on autonomous systems: the $(I - \tfrac{\stepsize}{2}\statemat)^{-1}(I + \tfrac{\stepsize}{2}\statemat)$ Padé approximation of $e^{\statemat \stepsize}$ agrees with the true exponential only through second order. The trade is one of structure. Both schemes send purely imaginary eigenvalues onto the unit circle: for $\lambda = i\omega$ , ZOH gives the discrete eigenvalue $e^{i\omega\stepsize}$ of modulus $\abs{e^{i\omega\stepsize}} = 1$ (ZOH is autonomous-exact, so it reproduces the oscillation outright), and the bilinear Möbius image likewise has modulus 1 — neither damps a marginally stable mode. They differ in how they wrap the imaginary axis onto the circle: ZOH’s map $\omega \mapsto e^{i\omega\stepsize}$ is $2\pi/\stepsize$ -periodic, so frequencies above the Nyquist rate $\pi/\stepsize$ alias onto lower ones, whereas the bilinear map is injective on the imaginary axis — it never aliases, at the cost of warping the frequency scale (compressing large $\omega$ toward $\mu = -1$ ). Bilinear is therefore the natural choice when a system has fast oscillatory modes you need to keep distinct; preserving oscillatory structure is the thread the symplectic methods of Chapter 6 take up in earnest.

4.5 Exponential-family discretizations

The third family takes a different approach: rather than approximating $e^{\statemat \stepsize}$ by a low-order Padé form (as bilinear does) or by the identity-plus-linear truncation $I + \stepsize \statemat$ (as forward Euler does), it computes $e^{\statemat \stepsize}$ exactly and approximates only the forcing integral. This is the philosophy of exponential integrators Hochbruck & Ostermann (2010) .

The exact one-step formula from variation of parameters (Chapter 1, §1.2) is

\statevec(t_{k+1}) = e^{\statemat \stepsize} \statevec_k + \int_0^{\stepsize} e^{\statemat (\stepsize - s)} \inputmat u(t_k + s) \, ds.

ZOH approximates $u(t_k + s) \approx u_k$ on $[0, \stepsize]$ ; bilinear approximates the integral by the trapezoidal rule with a Padé-approximated exponential. The exponential-trapezoidal scheme keeps the exact exponential and uses a linear interpolation of the input: $u(t_k + s) \approx u_k + (s/\stepsize)(u_{k+1} - u_k)$ . Substituting and evaluating the resulting two integrals gives

\statevec_{k+1} = e^{\statemat \stepsize} \statevec_k + \stepsize \, \varphi_1(\statemat \stepsize) \, \inputmat \, u_k + \stepsize \, \varphi_2(\statemat \stepsize) \, \inputmat \, (u_{k+1} - u_k),

where $\varphi_1$ and $\varphi_2$ are the first two $\varphi$ -functions of the exponential family:

\varphi_1(z) = \frac{e^z - 1}{z}, \qquad \varphi_2(z) = \frac{e^z - 1 - z}{z^2}.

Both are entire functions (the apparent singularities at $z = 0$ are removable: $\varphi_1(0) = 1$ , $\varphi_2(0) = 1/2$ ). Applied to the matrix $\statemat \stepsize$ , they produce matrix-valued objects $\varphi_1(\statemat \stepsize), \varphi_2(\statemat \stepsize)$ that can be computed via the same augmented-matrix-exponential trick as ZOH.

Reading off the discrete matrices in the form $\statevec_{k+1} = \discA \statevec_k + \discB_0 u_k + \discB_1 (u_{k+1} - u_k)$ :

\discA = e^{\statemat \stepsize}, \qquad \discB_0 = \stepsize \, \varphi_1(\statemat \stepsize) \, \inputmat, \qquad \discB_1 = \stepsize \, \varphi_2(\statemat \stepsize) \, \inputmat.

Notice that $\discA$ is the same as ZOH, so exp-trapezoidal preserves stability identically to ZOH: A-stable, exact for autonomous systems, eigenvalues in the unit disk for any $\stepsize > 0$ . The improvement is in the order of accuracy: by interpolating $u$ linearly rather than holding it constant, exp-trapezoidal becomes second-order accurate (provided $u$ is $C^2$ — the linear-interpolation error is governed by $u''$ ).

Exp-trapezoidal is the discretization of choice when the continuous dynamics are stiff — when $\statemat$ has eigenvalues with very different magnitudes, so the linear part is the hard part of the problem. By treating $e^{\statemat \stepsize}$ exactly the scheme sidesteps the step-size restriction that explicit methods inherit from the stiff eigenvalues, and the matrix exponential is computed once per layer in practice. Chapter 10 returns to this story for Mamba-3, which adopts a scheme from this exponential-trapezoidal family precisely because the input-dependent $\statemat(u_t)$ in selective SSMs produces stiff dynamics that ZOH and bilinear handle poorly.

A subtle implementation point: $\varphi_2(\statemat \stepsize)$ is not numerically equal to $(e^{\statemat \stepsize} - I - \statemat \stepsize)/(\statemat \stepsize)^2$ when this formula is computed naively, because catastrophic cancellation in the numerator destroys precision for small $\stepsize$ . The standard remedy is to compute all $\varphi$ -functions simultaneously from one augmented matrix exponential Al-Mohy & Higham (2011) , again using the trick that produced $\discB$ for ZOH. Both companions (exp_trapezoidal.py in JAX and discretization_atlas.jl in Julia) implement the augmented form.

Log-log error vs step size for exp-trapezoidal, ZOH, and bilinear on a forced damped oscillator. — Exp-trapezoidal (rose) achieves the same slope-2 convergence as bilinear (gold) while inheriting ZOH's autonomous-exactness — the best of both. ZOH (navy) lags at slope 1. Empirical fit on the two finest step sizes confirms the theoretical orders. Produced by companions/ch04/jax/exp_trapezoidal.py.

4.6 Order of accuracy and the discretization hierarchy

The three schemes are summarized in the table below. “Autonomous-exact” means the scheme produces zero error for the homogeneous problem ( $u \equiv 0$ ). “A-stable” means $\rho(\discA) \le 1$ for every $\stepsize > 0$ on every stable LTI test problem.

| Scheme | $\discA$ | Order | Autonomous-exact | A-stable | |---|---|---|---|---| | Forward Euler | $I + \stepsize \statemat$ | 1 | No | No | | ZOH | $e^{\statemat \stepsize}$ | 1 | Yes | Yes | | Bilinear (Tustin) | $(I - \tfrac{\stepsize}{2}\statemat)^{-1}(I + \tfrac{\stepsize}{2}\statemat)$ | 2 | No | Yes | | Exp-trapezoidal | $e^{\statemat \stepsize}$ | 2 | Yes | Yes |

The empirical order can be read directly off a log-log plot of error against step size: a $p$ -th-order scheme has error $\propto \stepsize^p$ , so on a log-log plot the data lie on a line of slope $p$ . The Julia companion discretization_atlas.jl runs this sweep against a high-accuracy Tsit5 reference solution from DifferentialEquations.jl; the JAX companion discretization_comparison.py performs the same sweep using scipy.integrate.solve_ivp (Radau) as the reference.

Log-log plot of max error versus step size for ZOH (slope ~1), bilinear and exp-trapezoidal (slope ~2) on a forced damped oscillator. — Empirical order of accuracy of the three schemes on the forced damped oscillator $\\ddot q + 0.5 \\dot q + 4 q = \\sin(2t)$. ZOH (navy): slope 1 on the log-log plot, confirming first-order convergence. Bilinear (gold) and exp-trapezoidal (rose): slope 2, confirming second-order convergence. Produced by companions/ch04/jax/discretization_comparison.py.

The hierarchy is not strict: ZOH’s autonomous-exactness can outweigh its first-order accuracy when the forcing is mild relative to the homogeneous decay. Bilinear’s exact imaginary-axis preservation can outweigh its non-exactness on autonomous problems when the system has long-lived oscillations. Exp-trapezoidal combines the best of both — exact on autonomous, second-order on forced — at the cost of computing more $\varphi$ -functions per step. The trade-offs are recurring themes throughout the SSM literature: the original S4 discretized with bilinear Gu et al. (2022) ; S4D supports either rule and finds the choice empirically negligible Gu et al. (2022) ; S5 and the Mamba line settled on ZOH for its simplicity and exactness on the (autonomous) drift; Mamba-3 switches to exp-trapezoidal once the input-dependence makes the dynamics stiff enough that the second-order error compounding matters.

4.7 What’s next

This chapter introduced three schemes and proved each preserves Lyapunov stability. Chapter 5 zooms in on the stability region of each scheme — the set of $\statemat \stepsize$ values for which $\rho(\discA) \le 1$ — and develops the Butcher-tableau machinery for systematically constructing higher-order Runge–Kutta methods. Chapter 6 then asks what to do when even A-stability is not enough — when the system is stiff or Hamiltonian and you need L-stable implicit methods, or symplectic methods that preserve geometric invariants. The exp-trapezoidal scheme of §4.5 turns out to be a member of a much larger family of exponential integrators, all of which deserve a place in your numerical-analysis toolkit.

Forward-looking SSM connections: ZOH is this book’s default in Chapters 7–9, and natively the choice of S5 and Mamba-1/2; the original S4 and S4D papers used bilinear, with the S4D ablations finding the two interchangeable. Mamba-3 uses a scheme from this exponential-trapezoidal family, treated in Chapter 10 — §10.2 derives the specific trapezoidal-quadrature variant it adopts, a second-order cousin of the $\varphi$ -interpolant form above.

4.8 Exercises

Six problems mixing computation and theory. Short/numerical (4.1–4.3) have inline collapsible solutions; long/proof exercises (4.4–4.6) have full worked solutions in §4.9.

Exercise 4.1 (computation)

Compute $\discA$ for ZOH on the 1-D test problem $\statemat = -2$ , $\inputmat = 1$ , $\stepsize = 0.1$ . Then verify by hand that $\abs{\discA} < 1$ .

Solution

$\discA = e^{-2 \cdot 0.1} = e^{-0.2} \approx 0.8187$ . The discrete eigenvalue is $0.8187$ , strictly inside the unit disk. The continuous decay rate is $2$ per unit time; the discrete decay factor per step is $e^{-0.2} \approx 0.8187$ , so after 10 steps (one continuous-time unit) the state is reduced by a factor $0.8187^{10} \approx 0.135 = e^{-2}$ — matching the continuous decay exactly. ZOH is autonomous-exact.

Exercise 4.2 (computation)

Compute $\discA$ for the bilinear transform on the same 1-D problem ( $\statemat = -2$ , $\stepsize = 0.1$ ). Compare $\abs{\discA}_{\text{bilinear}}$ against $\abs{\discA}_{\text{ZOH}} = e^{-0.2}$ — which is smaller, and why?

Solution

$\discA_{\text{bilinear}} = (1 + (0.1/2)(-2))/(1 - (0.1/2)(-2)) = 0.9/1.1 \approx 0.8182$ . ZOH gives $e^{-0.2} \approx 0.8187$ . The bilinear value is slightly smaller, meaning bilinear overdamps slightly relative to the true exponential. This is the second-order Padé approximation underestimating $e^{-0.2}$ : bilinear’s $\discA$ is the $(1,1)$ -Padé approximant $(1 + z/2)/(1 - z/2)$ of $e^z$ (the §4.4 margin note), which matches $e^z$ through $z^2$ but carries $z^3/4$ at third order against the exact $z^3/6$ , so the leading gap is $z^3/12 \approx -6.7 \times 10^{-4}$ at $z = -0.2$ . For small $\stepsize \cdot \abs{\lambda}$ the gap is $O(\stepsize^3)$ ; here the net difference (after the higher-order terms) is about $5 \cdot 10^{-4}$ .

Exercise 4.3 (computation + code)

Run companions/ch04/jax/discretization_comparison.py and verify empirically that ZOH has slope $\approx 1$ on the log-log error-vs- $\stepsize$ plot, bilinear and exp-trapezoidal have slope $\approx 2$ . What happens to the ZOH slope if you set the forcing $u(t) \equiv 0$ (autonomous case)?

Solution

The companion’s error_sweep function prints slopes near 1.00, 2.00, 2.00. Setting $u \equiv 0$ in the autonomous case makes ZOH exact — the error drops to roundoff ( $\sim 10^{-14}$ in float64) and the slope is meaningless. This is consistent with the autonomous-exactness property of §4.3: ZOH’s only error source for forced systems is the piecewise-constant approximation of $u$ .

Exercise 4.4 (theory) — solution in §4.9

Prove the augmented matrix exponential trick used in §4.3: that

\exp\!\begin{pmatrix} M_1 & M_2 \\ 0 & 0 \end{pmatrix} = \begin{pmatrix} e^{M_1} & M_1^{-1}(e^{M_1} - I)\, M_2 \\ 0 & I \end{pmatrix}

for invertible $M_1$ , by computing the series term-by-term.

Exercise 4.5 (theory) — solution in §4.9

Prove that the Möbius transformation $z \mapsto (1 + z/2)/(1 - z/2)$ sends the open left half-plane $\set{z : \operatorname{Re}(z) < 0}$ bijectively to the open unit disk $\set{w : \abs{w} < 1}$ , and sends the imaginary axis bijectively to the unit circle (minus the point $w = -1$ ).

Exercise 4.6 (theory) — solution in §4.9

Derive the exponential-trapezoidal scheme of §4.5 from the variation-of-parameters formula by substituting the linear interpolation $u(t_k + s) = u_k + (s/\stepsize)(u_{k+1} - u_k)$ into the forcing integral and evaluating term-by-term using the $\varphi$ -function identities.

4.9 Full solutions to theory exercises

Solution to Exercise 4.4

The matrix $M := \begin{pmatrix} M_1 & M_2 \\ 0 & 0 \end{pmatrix}$ satisfies, for $k \ge 1$ ,

M^k = \begin{pmatrix} M_1^k & M_1^{k-1} M_2 \\ 0 & 0 \end{pmatrix}.

This is verified by induction: $M^1$ matches by construction, and $M^{k+1} = M \cdot M^k = \begin{pmatrix} M_1 & M_2 \\ 0 & 0 \end{pmatrix} \begin{pmatrix} M_1^k & M_1^{k-1} M_2 \\ 0 & 0 \end{pmatrix} = \begin{pmatrix} M_1^{k+1} & M_1^k M_2 \\ 0 & 0 \end{pmatrix}$ as required. Now sum the matrix-exponential series:

e^M = I + \sum_{k=1}^{\infty} \frac{M^k}{k!} = \begin{pmatrix} I & 0 \\ 0 & I \end{pmatrix} + \begin{pmatrix} \sum_{k \ge 1} \tfrac{M_1^k}{k!} & \sum_{k \ge 1} \tfrac{M_1^{k-1}}{k!} M_2 \\ 0 & 0 \end{pmatrix} = \begin{pmatrix} e^{M_1} & S \, M_2 \\ 0 & I \end{pmatrix},

where $S := \sum_{k \ge 1} M_1^{k-1}/k! = M_1^{-1} \sum_{k \ge 1} M_1^k / k! = M_1^{-1}(e^{M_1} - I)$ , using invertibility of $M_1$ to extract the leading $M_1^{-1}$ . Plugging back gives the claimed result. ∎

This is exactly the identity used to compute $\discA$ and $\discB$ jointly from one expm call in §4.3: take $M_1 = \statemat \stepsize$ and $M_2 = \inputmat \stepsize$ .

Solution to Exercise 4.5

Let $T(z) := (1 + z/2)/(1 - z/2)$ . Imaginary axis to unit circle: for $z = i\omega$ with $\omega \in \R$ ,

\abs{T(i\omega)}^2 = \frac{\abs{1 + i\omega/2}^2}{\abs{1 - i\omega/2}^2} = \frac{1 + \omega^2/4}{1 + \omega^2/4} = 1,

so $T(i\omega)$ lies on the unit circle. The map $\omega \mapsto T(i\omega)$ traces the unit circle once as $\omega$ ranges over $\R$ , missing only the limit point $T(\infty) = -1$ .

Left half-plane to unit disk: write $z = x + iy$ with $x < 0$ . Then

\abs{T(z)}^2 = \frac{(1 + x/2)^2 + (y/2)^2}{(1 - x/2)^2 + (y/2)^2}.

Since $x < 0$ , $\abs{1 + x/2} < \abs{1 - x/2}$ (both for $\abs{x} < 2$ where the signs match the absolute values, and for $\abs{x} > 2$ where they reverse). The $y$ -terms are identical in numerator and denominator. Therefore the numerator is strictly less than the denominator, so $\abs{T(z)} < 1$ .

Bijectivity: Möbius transformations $z \mapsto (az + b)/(cz + d)$ with $ad - bc \ne 0$ are bijections of the Riemann sphere $\C \cup \set{\infty}$ to itself. Restriction to the half-plane gives a bijection to the disk. The inverse is $w \mapsto (2(w-1))/(w+1)$ , which can be verified by direct computation. ∎

This Möbius geometry is the algebraic content of A-stability for bilinear: a scheme is A-stable iff its eigenvalue map sends the open left half-plane into the closed unit disk. Bilinear achieves this bijectively; ZOH does so as well but non-bijectively (it folds an infinite strip of width $2\pi/\stepsize$ onto the unit disk, aliasing high-frequency continuous modes onto low-frequency discrete ones).

Solution to Exercise 4.6

Start from variation of parameters on $[t_k, t_{k+1}]$ :

\statevec_{k+1} = e^{\statemat \stepsize} \statevec_k + \int_0^{\stepsize} e^{\statemat(\stepsize - s)} \inputmat \, u(t_k + s) \, ds.

Substitute the linear interpolant $u(t_k + s) = u_k + (s/\stepsize)(u_{k+1} - u_k)$ for $s \in [0, \stepsize]$ :

\int_0^{\stepsize} e^{\statemat(\stepsize - s)} \inputmat \left[u_k + \frac{s}{\stepsize}(u_{k+1} - u_k)\right] ds = I_0 \, u_k + I_1 \, (u_{k+1} - u_k),

with $I_0 := \int_0^{\stepsize} e^{\statemat (\stepsize - s)} \inputmat \, ds$ and $I_1 := \int_0^{\stepsize} \tfrac{s}{\stepsize} e^{\statemat (\stepsize - s)} \inputmat \, ds$ . Change variables $\sigma = \stepsize - s$ in $I_0$ :

I_0 = \int_0^{\stepsize} e^{\statemat \sigma} \, d\sigma \cdot \inputmat = \statemat^{-1}(e^{\statemat \stepsize} - I) \, \inputmat = \stepsize \, \varphi_1(\statemat \stepsize) \, \inputmat,

where in the last step we substituted the definition $\varphi_1(z) = (e^z - 1)/z$ applied with argument $\statemat \stepsize$ . For $I_1$ , change variables again $\sigma = \stepsize - s$ , so $s = \stepsize - \sigma$ and $s/\stepsize = 1 - \sigma/\stepsize$ :

I_1 = \int_0^{\stepsize} \left(1 - \frac{\sigma}{\stepsize}\right) e^{\statemat \sigma} d\sigma \cdot \inputmat = \frac{1}{\stepsize}\left[\stepsize \, I_0 - \int_0^{\stepsize} \sigma \, e^{\statemat \sigma} d\sigma \cdot \inputmat\right].

The remaining $\sigma$ -weighted integral evaluates via integration by parts to $\statemat^{-2}(e^{\statemat \stepsize}(\statemat \stepsize - I) + I) \inputmat = \stepsize^2 \varphi_2(\statemat \stepsize) \inputmat$ , using the $\varphi_2(z) = (e^z - 1 - z)/z^2$ identity. Carefully tracking the algebra gives $I_1 = \stepsize \, \varphi_2(\statemat \stepsize) \, \inputmat$ as claimed. Combining,

\statevec_{k+1} = e^{\statemat \stepsize} \statevec_k + \stepsize \, \varphi_1(\statemat \stepsize) \, \inputmat \, u_k + \stepsize \, \varphi_2(\statemat \stepsize) \, \inputmat \, (u_{k+1} - u_k). \qquad \square

The local truncation error is $O(\stepsize^3)$ for $C^2$ inputs (the linear-interpolation error is $O(\stepsize^2)$ — governed by $u''$ — over an interval of length $\stepsize$ , and the integral picks up an additional $\stepsize$ factor). Global error is therefore $O(\stepsize^2)$ — second-order, as advertised.

4.10 Companion code

Two JAX companions, two PyTorch companions, and one Julia companion for Chapter 4:

JAX (companions/ch04/jax/):

discretization_comparison.py — implements ZOH, bilinear, and forward Euler in JAX; runs the error-vs-step-size sweep on a forced damped oscillator; emits eigenvalue_migration.png and order_convergence.png.
exp_trapezoidal.py — implements the exp-trapezoidal scheme via the augmented matrix exponential; verifies second-order convergence against the same forced oscillator; emits exp_trap_convergence.png.

Julia (companions/ch04/julia/):

discretization_atlas.jl — ports the post_transformers Week-9 reference implementation to ssm-foundations; uses DifferentialEquations.jl Tsit5 as the ground-truth reference solver and reports the empirical slope per scheme.
Project.toml / Manifest.toml — companion-local Julia environment pinning DifferentialEquations and LinearAlgebra.

PyTorch (companions/ch04/torch/):

discretization_comparison.py — the forward-Euler / ZOH / bilinear discretizers and the error-vs-step sweep (compute-only; the JAX companion produces the figures).
exp_trapezoidal.py — the exp-trapezoidal scheme via the augmented matrix exponential, with ZOH and bilinear baselines.
tests/ — cross-framework parity: the torch discretizers and trajectories equal their JAX counterparts to within $10^{-9}$ (float64).

To run from the repo root:

# JAX (uses post_transformers .venv with scipy, numpy, matplotlib, jax pre-installed)
PYTHONPATH=. python companions/ch04/jax/discretization_comparison.py
PYTHONPATH=. python companions/ch04/jax/exp_trapezoidal.py

# PyTorch (needs the .venv [torch] extra; parity only, no figures)
PYTHONPATH=. python companions/ch04/torch/discretization_comparison.py
PYTHONPATH=. python companions/ch04/torch/exp_trapezoidal.py

# Julia (first run will precompile DifferentialEquations.jl; ~2–5 minutes)
julia --project=companions/ch04/julia companions/ch04/julia/discretization_atlas.jl

All figures emit to public/figures/ch04/.

Part I · Foundations Week 5 Published

Stability regions, Butcher tableau, and order conditions

Runge–Kutta theory, the Butcher tableau formalism, order conditions, stability regions in the complex plane, A-stability and L-stability, and the Dahlquist barrier — applied to the Chapter 4 discretization atlas.

Stability regions, Butcher tableau, and order conditions

Chapter 5 — at a glance

Goal: develop the systematic machinery for constructing and analyzing one-step integration schemes. The Butcher tableau is a compact bookkeeping device that parameterizes every Runge–Kutta method — explicit, implicit, embedded, symplectic — by a triple of arrays $(A, b, c)$ . Order conditions are the algebraic equations a tableau must satisfy to have a given convergence order. The stability function $\stabfn(z)$ summarizes how a method behaves on the scalar test problem $\dot y = \lambda y$ , and its stability region $\set{z \in \C : \abs{\stabfn(z)} \le 1}$ encodes the geometry of A-stability. By the end you can read the tableau of an arbitrary method, identify its order, plot its stability region, and decide whether it is suitable for a given problem.

Reading time: ~45 minutes prose; 60–90 minutes with the JAX and Julia companions.

Key insight: the choice of discretization for an SSM is a choice of point in discretization-design space. Bilinear sits at the trapezoidal-rule corner of the Runge–Kutta tableau interior (a rational stability function); ZOH and exp-trapezoidal sit just outside that interior, in the exponential-integrator region whose stability functions are neither polynomial nor rational (§5.4); the Mamba-3 paper Lahoti et al. (2026) effectively shops this space for points that combine A-stability with second-order accuracy. The Dahlquist barrier (§5.4) says you cannot have everything at once: no explicit Runge–Kutta method is A-stable at all — at any order. This is the structural reason the next chapter switches to implicit methods.

5.1 Runge–Kutta methods and the Butcher tableau

Chapter 4 introduced ZOH, bilinear, and exp-trapezoidal as three specific one-step schemes for the linear state-space system. They are members of a much larger family: one-step methods that compute $\statevec_{k+1}$ from $\statevec_k$ using only one previous state and possibly intermediate stage evaluations of the right-hand side. The Runge–Kutta (RK) family is the most studied subfamily.

For the autonomous initial-value problem $\dot \statevec = f(\statevec)$ — we drop the explicit input dependence for this chapter; everything generalizes — an $s$ -stage Runge–Kutta method computes

\begin{aligned} k_i &= f\!\left(\statevec_k + \stepsize \sum_{j=1}^{s} a_{ij} k_j\right), \qquad i = 1, \ldots, s, \\ \statevec_{k+1} &= \statevec_k + \stepsize \sum_{i=1}^{s} b_i k_i. \end{aligned}

The $k_i$ are the stage derivatives — evaluations of $f$ at intermediate states. The constants $(a_{ij}, b_i, c_i)$ define the method, where $c_i := \sum_j a_{ij}$ are the stage times. The Butcher tableau is the standard tabular notation

\begin{array}{c|c} c & A \\ \hline & b^\top \end{array} \qquad \text{with } A = (a_{ij}) \in \R^{s \times s}, \ b \in \R^s, \ c \in \R^s.

The method is explicit if $A$ is strictly lower triangular ( $a_{ij} = 0$ for $j \ge i$ ), in which case each $k_i$ depends only on previously computed $k_1, \ldots, k_{i-1}$ . It is implicit otherwise, requiring a nonlinear system solve at each step.

Three classical examples make the formalism concrete.

Forward Euler. The simplest explicit method, $s = 1$ :

\begin{array}{c|c} 0 & 0 \\ \hline & 1 \end{array} \qquad \Longrightarrow \qquad \statevec_{k+1} = \statevec_k + \stepsize f(\statevec_k).

The midpoint rule (RK2). A two-stage explicit method:

\begin{array}{c|cc} 0 & 0 & 0 \\ 1/2 & 1/2 & 0 \\ \hline & 0 & 1 \end{array} \qquad \Longrightarrow \qquad \begin{aligned} k_1 &= f(\statevec_k), \\ k_2 &= f(\statevec_k + \tfrac{\stepsize}{2} k_1), \\ \statevec_{k+1} &= \statevec_k + \stepsize k_2. \end{aligned}

Classical RK4. The most-quoted explicit method, four stages:

\begin{array}{c|cccc} 0 & 0 & 0 & 0 & 0 \\ 1/2 & 1/2 & 0 & 0 & 0 \\ 1/2 & 0 & 1/2 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ \hline & 1/6 & 1/3 & 1/3 & 1/6 \end{array}

Fourth-order globally, the workhorse of physics simulations for half a century.

The tableau abstraction was introduced by John Butcher in the 1960s and is now the lingua franca of one-step methods: every reference paper on numerical integration presents its method as a Butcher tableau. The constraint $\sum_j a_{ij} = c_i$ (consistency of stage times) and $\sum_i b_i = 1$ (consistency of the weighted sum) are imposed by every tableau worth its name. Without them the method does not even reproduce the trivial constant trajectory $\statevec(t) = \statevec_0$ when $f \equiv 0$ .

5.2 Order conditions

How do we know that classical RK4 is fourth-order globally? The convergence order of an RK method is determined by order conditions: polynomial identities on the entries of $(A, b, c)$ that must hold for the local truncation error to be $O(\stepsize^{p+1})$ .

The derivation expands $\statevec(t_k + \stepsize)$ as a Taylor series and matches term-by-term against the RK output $\statevec_{k+1}$ produced from $\statevec(t_k)$ . For a scalar autonomous ODE the leading conditions are:

Order 1: $\sum_i b_i = 1$
Order 2: also $\sum_i b_i c_i = 1/2$
Order 3: also $\sum_i b_i c_i^2 = 1/3$ and $\sum_{i,j} b_i a_{ij} c_j = 1/6$
Order 4: four more conditions involving products of $b_i, c_i, a_{ij}$

For systems (the case we actually care about), the number of order conditions grows combinatorially: the cumulative count through order $p$ is 1, 2, 4, 8, 17, 37 for $p = 1, \ldots, 6$ — one condition per Butcher tree of order $\le p$ Hairer et al. (1993) . Butcher organized these conditions using a bijection between order-conditions and rooted trees (each tree corresponds to an integer partition of $f$ -evaluations and gives one polynomial identity). The combinatorial explosion is one reason 8th-order RK methods are rarely used: there are over 200 order conditions to satisfy simultaneously, and the resulting tableaux have dozens of stages.

Theorem 5.1.

(Butcher order conditions.) Suppose $f$ is sufficiently smooth. An $s$ -stage Runge–Kutta method with tableau $(A, b, c)$ is of order $p$ — meaning $\statevec_{k+1} - \statevec(t_{k+1}) = O(\stepsize^{p+1})$ — if and only if a specific finite set of polynomial conditions on $(A, b, c)$ holds (one per Butcher tree of order $\le p$ ). The Lax equivalence theorem (Chapter 4) then guarantees global convergence at order $p$ .

The proof is non-trivial and we will not reproduce it; the canonical reference is Hairer–Nørsett–Wanner Chapter II.2 Hairer et al. (1993) . For our purposes the takeaway is that the order of an RK method is a property of its tableau entries, derivable by algebraic manipulation alone. You can verify (Exercise 5.1) that RK4’s tableau satisfies the order conditions through order 3; you can also verify (Exercise 5.2) that an arbitrary two-stage explicit tableau cannot achieve order higher than 2, regardless of how you choose the free parameters.

Butcher tableau matrix visualizations for forward Euler, midpoint RK2, classical RK4, and Runge-Kutta-Fehlberg 4(5). — Butcher tableau matrices for four canonical methods, visualized as heatmaps of the (A | b, c) augmented matrix. Forward Euler (1×1, trivial). Midpoint RK2 (2×2, single off-diagonal). Classical RK4 (4×4, characteristic strictly-lower-triangular A with the $b = (1/6, 1/3, 1/3, 1/6)$ weights). RKF45 (6×6, dense lower-triangular). The sparsity pattern of A is what distinguishes explicit from implicit methods. Produced by companions/ch05/jax/stability_regions.py.

Log-log error vs step size for forward Euler (slope 1), midpoint RK2 (slope 2), and classical RK4 (slope 4) on a forced damped oscillator. — Empirical verification of the Butcher order conditions (§5.2). Each method's local truncation error decreases as $\\Delta^{p+1}$, giving global error $\\Delta^p$ — a $p$-th order method has slope $p$ on the log-log plot. RK1, RK2, and RK4 all match their theoretical orders to within ~5% on this forced linear oscillator. Produced by companions/ch05/jax/order_verification.py.

5.3 The stability function and stability regions

Order conditions characterize how a method behaves as $\stepsize \to 0$ on a smooth problem. Stability characterizes how it behaves at fixed $\stepsize$ on a stiff problem — when the time scales in the dynamics are well-separated and at least one is much faster than $\stepsize$ .

The standard probe is the Dahlquist test equation

\dot y(t) = \lambda y(t), \qquad y(0) = 1, \qquad \lambda \in \C.

The exact solution is $y(t) = e^{\lambda t}$ . Applying an $s$ -stage RK method to this scalar problem produces, after one step,

y_{k+1} = \stabfn(\lambda \stepsize) \, y_k,

where $\stabfn : \C \to \C$ is the stability function of the method. For an explicit RK method, $\stabfn(z)$ is a polynomial in $z$ of degree $\le s$ :

\stabfn_{\text{explicit}}(z) = 1 + z b^\top (I - z A)^{-1} \mathbf{1},

where $\mathbf{1}$ is the all-ones $s$ -vector and the inverse expands as a finite geometric series because $A$ is nilpotent (strictly lower-triangular). For an implicit RK method, $\stabfn(z)$ is a rational function $P(z)/Q(z)$ .

Three examples make this concrete.

Forward Euler. With $A = 0$ , $b = 1$ : $\stabfn(z) = 1 + z$ .

Midpoint RK2. Working through the formula gives $\stabfn(z) = 1 + z + z^2/2$ — the second-order Taylor polynomial of $e^z$ .

Classical RK4. $\stabfn(z) = 1 + z + z^2/2 + z^3/6 + z^4/24$ — the fourth-order Taylor polynomial.

A pattern emerges: for a $p$ -th order explicit RK method, $\stabfn(z)$ agrees with $e^z$ through terms of degree $p$ . This is no coincidence — order $p$ requires the method to reproduce $e^z$ correctly on the test problem through that order in $\stepsize$ . The error $\stabfn(z) - e^z$ is $O(z^{p+1})$ near $z = 0$ .

The stability region of the method is the set

\stabregion = \set{z \in \C : \abs{\stabfn(z)} \le 1}.

On the test problem with $\lambda$ in the left half-plane (so the exact solution decays), the method’s iterates $y_k = \stabfn(\lambda \stepsize)^k$ stay bounded iff $\lambda \stepsize \in \stabregion$ . The stability region is the constraint on $\lambda \stepsize$ that keeps the method well-behaved on stiff problems — equivalently, the set of $\stepsize$ values you can take for a given $\lambda$ .

Stability regions of forward Euler (disk), midpoint RK2 (lobe), and classical RK4 (kidney) in the complex plane. — Stability regions of three explicit RK methods. Forward Euler: disk of radius 1 centered at $z = -1$. Midpoint RK2: oval extending further into the left half-plane. Classical RK4: kidney-shaped region that includes a portion of the imaginary axis (relevant for oscillatory problems). None of these regions contains the entire left half-plane — explicit RK methods are never A-stable. Produced by companions/ch05/jax/stability_regions.py.

The stability region of forward Euler is the disk $\abs{1 + z} \le 1$ — a disk of radius 1 centered at $z = -1$ . For a real-eigenvalue $\lambda < 0$ , this requires $\stepsize \le 2/\abs{\lambda}$ . The tighter this constraint becomes (large $\abs{\lambda}$ = stiff problem), the smaller the maximum step size — this is the step-size restriction that motivates implicit methods.

5.4 A-stability, L-stability, and the Dahlquist barrier

A method is A-stable if its stability region contains the entire closed left half-plane: $\stabregion \supseteq \set{z \in \C : \operatorname{Re}(z) \le 0}$ . On a stable linear problem with $\operatorname{Re}(\lambda) \le 0$ , an A-stable method preserves stability for every positive step size — no step-size restriction at all. This is the gold standard for stiff problems.

A method is L-stable if it is A-stable and $\stabfn(z) \to 0$ as $\operatorname{Re}(z) \to -\infty$ . L-stability damps fast eigenmodes aggressively, which is desirable when you want the discrete method to not preserve fast oscillations the continuous solution decays. Backward Euler is L-stable; the trapezoidal rule (= bilinear from Chapter 4) is A-stable but not L-stable (its stability function $\stabfn(z) = (1 + z/2)/(1 - z/2)$ tends to $-1$ , not 0, as $\operatorname{Re}(z) \to -\infty$ ). The distinction matters when the system has eigenvalues with very large negative real part: bilinear will preserve high-frequency oscillation modes that should physically be damped.

The fundamental theorem on A-stability of explicit methods is the Dahlquist barrier:

Theorem 5.2.

(Dahlquist barrier for explicit RK.) No explicit Runge–Kutta method is A-stable. The maximum order of an A-stable Runge–Kutta method (explicit or implicit) is unbounded, but explicit methods cannot achieve A-stability at any order.

The proof for the explicit case is direct: an explicit RK method has $\stabfn(z) =$ polynomial in $z$ , and any polynomial diverges as $\abs{z} \to \infty$ . So $\stabregion$ is bounded, and cannot contain the entire (unbounded) left half-plane. ∎

The implicit case is more subtle: there exists an A-stable Runge–Kutta method of every order, but the constructions are increasingly elaborate. The Gauss–Legendre family — Chapter 6’s main attraction — produces A-stable implicit methods of every even order. The $s$ -stage Gauss–Legendre method is order $2s$ , optimally accurate among $s$ -stage implicit RK methods, and A-stable, and (a bonus for Hamiltonian systems) symplectic.

The practical consequence for SSM discretization: by Theorem 5.2, explicit and A-stable are mutually exclusive — at any order — so A-stability forces you out of the explicit-RK family, into an implicit method or a non-polynomial (exponential) stability function. (The separate order- $\le 2$ ceiling — “A-stable and order $> 2$ is impossible” — is the second Dahlquist barrier, and it binds A-stable linear multistep methods, not RK; Chapter 6 §6.2 states it. The two are easy to conflate: explicit RK fails A-stability outright, regardless of order.) The Chapter 4 schemes all give up explicitness:

Forward Euler: explicit, order 1, not A-stable.
ZOH and exp-trapezoidal: A-stable and of arbitrary order on autonomous problems, by virtue of using the matrix exponential — but they sit outside the explicit-RK family (they have non-polynomial stability functions).
Bilinear: A-stable, order 2, implicit in form (requires inverting $I - \stepsize \statemat / 2$ ).

The exponential integrators of §4.5 are a way to evade the Dahlquist barrier: by computing $e^{\statemat \stepsize}$ exactly, the stability function inherits the full LHP-stability of the exponential, and the polynomial-order restriction is moved from $\stabfn$ to a different object (the approximation of the forcing by $\varphi$ -functions). This is why exp-trapezoidal can be both “explicit in the input” (no nonlinear solve in $u$ ) and A-stable.

5.5 Stability regions of the Chapter 4 schemes

Reading the three Chapter 4 schemes through the lens of stability functions:

ZOH. From §4.3, $\discA = e^{\statemat \stepsize}$ , so for the scalar test problem $\stabfn_{\text{ZOH}}(z) = e^z$ . The stability region is $\set{z : \abs{e^z} \le 1} = \set{z : \operatorname{Re}(z) \le 0}$ — exactly the closed left half-plane. ZOH is A-stable, in fact also L-stable because $\abs{e^z} \to 0$ as $\operatorname{Re}(z) \to -\infty$ .

Bilinear (Tustin). From §4.4, $\stabfn_{\text{bilinear}}(z) = (1 + z/2)/(1 - z/2)$ . The Möbius geometry from Chapter 4 §4.4 / Exercise 4.5 says the LHP maps bijectively to the open unit disk, so the stability region is exactly the closed left half-plane — bilinear is A-stable. But $\stabfn_{\text{bilinear}}(z) \to -1$ as $\operatorname{Re}(z) \to -\infty$ , so bilinear is not L-stable. This is the trapezoidal-rule failure mode: oscillatory artifacts on very stiff problems.

Exp-trapezoidal. The homogeneous stability function is identical to ZOH’s: $\stabfn(z) = e^z$ on the test problem (the difference between ZOH and exp-trap shows up only in the forcing integrals; the homogeneous behavior is the same). So exp-trapezoidal is A-stable and L-stable, just like ZOH.

Stability regions of forward Euler, ZOH, bilinear, and exp-trapezoidal overlaid in the complex plane. — Stability regions of the schemes from Chapter 4 §4.6, with classical RK1 and RK4 included for reference. Forward Euler (RK1, navy) and classical RK4 (gold): bounded regions in the LHP. ZOH / exp-trapezoidal (rose) and bilinear (gray): the entire closed left half-plane — A-stable. Among the explicit-RK family, no method achieves A-stability (Dahlquist barrier); the schemes that do are the ones using $e^{A \\Delta}$ as the dynamics factor. Produced by companions/ch05/jax/stability_regions.py.

5.6 What’s next

This chapter has laid out the systematic framework for analyzing one-step methods: Butcher tableau, order conditions, stability functions, stability regions, A-stability and L-stability, and the Dahlquist barrier. Chapter 6 uses this framework to motivate the implicit family — backward Euler, BDF, and the Gauss–Legendre IRK schemes that achieve A-stability at every order — and the symplectic family, which trades order or A-stability for the preservation of geometric invariants. Chapter 6 is where the C1 niche pilot lives: symplectic integration of selective SSMs viewed as Hamiltonian flows on the hidden-state manifold.

A natural side path is embedded methods — pairs of RK tableaux of consecutive orders sharing stages, used for adaptive step-size control. The DP54 method (Dormand–Prince 5(4)) used by DifferentialEquations.jl’s default solver Tsit5 is a famous example. Embedded methods are excellent practical tools but tangential to the SSM pedagogy of this book; the companion discretization_atlas.jl references the RKF45 / DP54 tableaux for completeness.

5.7 Exercises

Six problems. Short/numerical (5.1–5.3) have inline collapsible solutions; long/proof exercises (5.4–5.6) have full worked solutions in §5.8.

Exercise 5.1 (computation)

Verify that classical RK4 satisfies the order conditions through order 3: $\sum b_i = 1$ (order 1), $\sum b_i c_i = 1/2$ (order 2), $\sum b_i c_i^2 = 1/3$ (order 3). Use the tableau in §5.1. (RK4’s fourth-order accuracy additionally requires the four order-4 conditions — one per order-4 rooted tree, e.g. $\sum_i b_i c_i^3 = 1/4$ — which make a good extension.)

Solution

From the tableau, $b = (1/6, 1/3, 1/3, 1/6)$ and $c = (0, 1/2, 1/2, 1)$ .

$\sum b_i = 1/6 + 1/3 + 1/3 + 1/6 = 2/6 + 4/6 = 1$ . ✓
$\sum b_i c_i = (1/6)(0) + (1/3)(1/2) + (1/3)(1/2) + (1/6)(1) = 0 + 1/6 + 1/6 + 1/6 = 1/2$ . ✓
$\sum b_i c_i^2 = (1/6)(0) + (1/3)(1/4) + (1/3)(1/4) + (1/6)(1) = 0 + 1/12 + 1/12 + 1/6 = 2/12 + 2/12 = 1/3$ . ✓

Exercise 5.2 (computation)

Show that an arbitrary two-stage explicit RK tableau (one free parameter $c_2 \in (0, 1]$ , with $a_{21} = c_2$ , $b_2$ to be determined, $b_1 = 1 - b_2$ ) cannot achieve order 3 no matter how $c_2, b_2$ are chosen.

Solution

For order 2 we need $b_1 + b_2 = 1$ (trivially $b_1 = 1 - b_2$ ) and $b_2 c_2 = 1/2$ , giving $b_2 = 1/(2c_2)$ , $b_1 = 1 - 1/(2c_2)$ . So the two-parameter family $(c_2, b_2)$ collapses to a one-parameter family of order-2 methods (midpoint $c_2 = 1/2$ , Heun $c_2 = 1$ , etc.).

For order 3 we additionally need $b_2 c_2^2 = 1/3$ and $b_2 a_{21} c_1 = 1/6$ . The second condition uses $c_1 = 0$ , giving $b_2 \cdot c_2 \cdot 0 = 0 \ne 1/6$ . Contradiction. So no two-stage explicit method achieves order 3.

This is a small instance of the general fact: order $p$ requires at least $p$ stages for explicit methods (Butcher’s order barriers).

Exercise 5.3 (computation + code)

Run companions/ch05/jax/order_verification.py. Confirm the empirical slopes match Forward Euler (1), midpoint RK2 (2), classical RK4 (4). What happens to the RK4 slope at very small $\stepsize$ (say $\stepsize = 10^{-4}$ )?

Solution

The companion’s order_verification function prints slopes $\approx 1.00$ , $\approx 2.00$ , $\approx 4.00$ for the three methods in turn. At very small $\stepsize$ , the RK4 error reaches the roundoff floor — typically $10^{-13}$ in float64 — and the slope flattens out as machine precision starts to dominate the method error. This is why empirical order is always estimated from the two finest dt where the method error still exceeds roundoff, not from arbitrarily fine dt.

Exercise 5.4 (theory) — solution in §5.8

Prove that for an explicit Runge–Kutta method with tableau $(A, b, c)$ , the stability function is

\stabfn(z) = 1 + z b^\top (I - z A)^{-1} \mathbf{1},

where $\mathbf{1} = (1, 1, \ldots, 1)^\top$ . Verify that for classical RK4 this evaluates to $1 + z + z^2/2 + z^3/6 + z^4/24$ .

Exercise 5.5 (theory) — solution in §5.8

Prove the Dahlquist-barrier statement for explicit RK methods: that no explicit RK method is A-stable. (The proof was sketched in §5.4; flesh it out, using the fact that explicit methods have polynomial stability functions.)

Exercise 5.6 (theory) — solution in §5.8

For the bilinear/Tustin scheme, show that the stability function $\stabfn(z) = (1 + z/2)/(1 - z/2)$ satisfies $\stabfn(z) \to -1$ as $\operatorname{Re}(z) \to -\infty$ along any direction with $\operatorname{Im}(z)$ bounded, and explain why this rules out L-stability. Discuss what this means for the discrete dynamics of a bilinear-discretized stable LTI system with very fast-decaying eigenmodes.

5.8 Full solutions to theory exercises

Solution to Exercise 5.4

Apply the $s$ -stage explicit RK method to the test ODE $\dot y = \lambda y$ , $y(0) = y_k = 1$ . The stage equations become

k_i = \lambda \left(1 + \stepsize \sum_{j<i} a_{ij} k_j\right) = \lambda + \lambda \stepsize \sum_{j<i} a_{ij} k_j.

Set $z := \lambda \stepsize$ and let $\kappa = (k_1, k_2, \ldots, k_s)^\top$ . The vector form is

\kappa = \lambda \mathbf{1} + z A \kappa \qquad \Longrightarrow \qquad (I - z A) \kappa = \lambda \mathbf{1}.

For an explicit method $A$ is strictly lower-triangular, hence nilpotent ( $A^s = 0$ ), and the geometric series $(I - z A)^{-1} = I + z A + z^2 A^2 + \cdots + z^{s-1} A^{s-1}$ truncates. So $\kappa = \lambda (I - z A)^{-1} \mathbf{1}$ , and

y_{k+1} = 1 + \stepsize b^\top \kappa = 1 + z \cdot b^\top (I - z A)^{-1} \mathbf{1} = \stabfn(z). \quad \square

For classical RK4, direct computation of $b^\top (I - z A)^{-1} \mathbf{1}$ — expanding the geometric series and multiplying out — gives, after some algebra,

\stabfn_{\text{RK4}}(z) = 1 + z + \frac{z^2}{2} + \frac{z^3}{6} + \frac{z^4}{24}.

This is the fourth-order Taylor polynomial of $e^z$ , as expected for a fourth-order method.

Solution to Exercise 5.5

For an explicit RK method, $A$ is strictly lower-triangular, so $A^s = 0$ and the geometric series for $(I - z A)^{-1}$ truncates at $z^{s-1}$ . Therefore

\stabfn(z) = 1 + z b^\top (I - z A)^{-1} \mathbf{1}

is a polynomial in $z$ of degree at most $s$ . Polynomials are unbounded: $\abs{\stabfn(z)} \to \infty$ as $\abs{z} \to \infty$ along any ray (assuming the leading coefficient is nonzero, which holds for any $s$ -stage method of order $\ge s$ , the only interesting case).

The closed left half-plane $\set{z : \operatorname{Re}(z) \le 0}$ is unbounded — it extends to $-\infty$ along the real axis. The stability region $\stabregion = \set{z : \abs{\stabfn(z)} \le 1}$ is therefore bounded. So $\stabregion$ cannot contain the entire LHP, and the method is not A-stable. ∎

The proof relies essentially on $\stabfn$ being a polynomial. Implicit methods have rational $\stabfn(z) = P(z)/Q(z)$ , which can be bounded at infinity if $\deg(Q) \ge \deg(P)$ . This is the precise sense in which implicit methods evade the Dahlquist barrier.

Solution to Exercise 5.6

Write $z = -r + i\omega$ with $r \to +\infty$ and $\omega$ bounded. Then

\stabfn(z) = \frac{1 + z/2}{1 - z/2} = \frac{1 + (-r/2 + i\omega/2)}{1 - (-r/2 + i\omega/2)} = \frac{1 - r/2 + i \omega/2}{1 + r/2 - i \omega/2}.

For large $r$ the dominant terms are $\stabfn(z) \approx (-r/2)/(r/2) = -1$ . More precisely,

\stabfn(z) = \frac{-r/2 + (1 + i\omega/2)}{r/2 + (1 - i\omega/2)} = -1 + O(1/r) \to -1 \quad \text{as } r \to \infty.

This rules out L-stability: L-stability requires $\stabfn \to 0$ , not $\to -1$ . ∎

Practical implication. For a stable LTI system $\dot \statevec = \statemat \statevec$ with $\statemat$ having an eigenvalue $\lambda$ with very large $\abs{\operatorname{Re}(\lambda)}$ — a fast-decaying mode — bilinear discretization at step size $\stepsize$ produces a discrete eigenvalue $\mu = (1 + \lambda \stepsize / 2)/(1 - \lambda \stepsize / 2) \approx -1$ . The corresponding mode in the discrete dynamics therefore alternates sign at every step before decaying, rather than smoothly decaying. This trapezoidal-rule oscillation is the classic L-stability failure mode: the discrete solution exhibits spurious oscillations not present in the continuous solution. For SSMs this manifests as oscillatory artifacts in the hidden state when the system has eigenvalues with very negative real parts — exactly the regime where stiff implicit methods like BDF (Chapter 6) are preferred over bilinear.

5.9 Companion code

Two JAX companions, two PyTorch companions, and one Julia companion for Chapter 5.

JAX (companions/ch05/jax/):

stability_regions.py — computes and plots stability regions for forward Euler, midpoint RK2, classical RK4, ZOH, bilinear, and exp-trapezoidal in the complex $z = \lambda \stepsize$ plane; emits rk_stability_regions.png, atlas_stability.png, and butcher_tableaux.png.
order_verification.py — implements RK1/RK2/RK4 from their tableaux; verifies empirical orders 1, 2, 4 on the forced damped oscillator from Chapter 4; emits order_verification.png.

Julia (companions/ch05/julia/):

butcher_tableau_zoo.jl — uses DifferentialEquations.jl solver tables (OrdinaryDiffEqTsit5, etc.) to extract Butcher coefficients programmatically and print them side-by-side with the analytic forms used in §5.1.
Project.toml / Manifest.toml — companion-local Julia environment.

PyTorch (companions/ch05/torch/):

stability_regions.py — the RK Butcher tableaux and stability functions $R(z)$ over the complex grid (compute-only; the JAX companion produces the figures).
order_verification.py — the tableau-driven RK1/RK2/RK4 integrator and empirical-slope routine.
tests/ — cross-framework parity: the torch $R(z)$ grids and RK trajectories equal their JAX counterparts to within $10^{-9}$ (float64).

To run from the repo root:

# JAX
PYTHONPATH=. python companions/ch05/jax/stability_regions.py
PYTHONPATH=. python companions/ch05/jax/order_verification.py

# PyTorch (needs the .venv [torch] extra; parity only, no figures)
PYTHONPATH=. python companions/ch05/torch/stability_regions.py
PYTHONPATH=. python companions/ch05/torch/order_verification.py

# Julia (first run will precompile DifferentialEquations.jl)
julia --project=companions/ch05/julia companions/ch05/julia/butcher_tableau_zoo.jl

All figures emit to public/figures/ch05/.

Part I · Foundations Week 6 Published

Implicit methods, stiff systems, symplectic integration

When explicit methods fail — backward Euler, BDF, DIRK as A-stable alternatives; exponential integrators as the family of choice for selective SSMs; Hamiltonian mechanics and symplectic integration as the natural geometric language for hidden-state dynamics.

Implicit methods, stiff systems, symplectic integration

Chapter 6 — at a glance

Goal: lay out the integrator families needed when the Chapter 4–5 explicit/A-stable trade-off becomes uncomfortable. Stiff systems force the step size of an explicit method down to the inverse of the largest eigenvalue, often by many orders of magnitude relative to the time scales of interest. Implicit methods (backward Euler, BDF, diagonally implicit RK) trade one nonlinear solve per step for unconditional stability. Exponential integrators sidestep the trade-off by treating the linear part exactly. And symplectic integrators abandon the optimality-in-error criterion for a different invariant — they preserve the geometric structure of Hamiltonian systems, giving bounded long-horizon energy error in regimes where order-optimized methods drift unboundedly.

Reading time: ~50 minutes prose; 90+ minutes with the JAX and Julia companions.

Direct-transfer hook: this chapter is the home of the C1 niche pilot, which asks whether selective state-space models can be analyzed as discretizations of Hamiltonian flows on the hidden-state manifold. The vortex-dynamics community has worked out the relevant symplectic-integration machinery in detail; transferring that machinery from continuum mechanics to ML hidden-state dynamics is the project. The chapter develops the necessary vocabulary — Hamiltonian, symplectic 2-form, modified Hamiltonian, energy drift — at a level that connects to the broader SSM landscape without assuming prior exposure.

6.1 Why explicit methods fail: stiffness

Chapter 5 ended with the Dahlquist barrier: no explicit Runge–Kutta method is A-stable. The barrier matters because of a phenomenon called stiffness — when the dynamics matrix $\statemat$ has eigenvalues with very different magnitudes.

A linear ODE $\dot \statevec = \statemat \statevec$ is stiff if $\statemat$ has at least one eigenvalue with $\operatorname{Re}(\lambda_i) \le -1/T$ for a time-scale $T$ much smaller than the simulation horizon, and the eigenvalue is paired with a long-lived (small- $\abs{\lambda}$ ) mode that one actually wants to track. Stiffness is therefore relative: a system is stiff with respect to a given horizon and given accuracy goal, not in absolute terms. The Robertson chemical-kinetics problem ( $\lambda$ ‘s spanning ten orders of magnitude) is the canonical textbook example; van der Pol at parameter $\mu \gg 1$ is a popular ML-adjacent stand-in because it generates a nonlinear oscillation whose slow and fast phases differ by $\mu$ .

For an explicit method, the step size is bounded by the fastest eigenvalue: $\stepsize \lesssim 2/\abs{\lambda_{\max}}$ (forward Euler) or some other small constant times $1/\abs{\lambda_{\max}}$ (RK4). If the slow eigenvalue we want to track is $\abs{\lambda_{\min}}$ , the integrator takes $\abs{\lambda_{\max}}/\abs{\lambda_{\min}}$ times more steps than the slow time scale would suggest. For Robertson this ratio is $10^{10}$ . The simulation never finishes.

For an SSM, stiffness arises naturally for input-dependent dynamics. Mamba’s selective scan generates an effective $\statemat(u_t)$ at each token; for inputs that produce eigenvalues with very different magnitudes, the time-discretized recurrence is solving a stiff problem at each step. If the discretization is not A-stable, the recurrence either blows up or requires a step size $\stepsize \ll 1$ — at which point you are running the model in an absurd regime. This is the structural reason Mamba-3 switched to exponential-trapezoidal (§4.5), which is A-stable in $\statemat$ but explicit in $u$ .

Two panels of position q(t) for the van der Pol oscillator at mu=10 over t in [0,50] at step sizes 0.005, 0.05, 0.2. Left (classical RK4, explicit): the two fine steps trace the relaxation cycle while the coarsest step diverges at once. Right (backward Euler, implicit): all three step sizes stay bounded. — Van der Pol oscillator ($\mu = 10$, initial state $(q,p)=(2,0)$) integrated to $t=50$ ($\sim$2.5 relaxation periods) at step sizes $\stepsize \in \{0.005, 0.05, 0.2\}$, showing position $q(t)$. **Left — classical RK4 (explicit):** $\stepsize = 0.005$ and $0.05$ resolve the relaxation oscillation, but at $\stepsize = 0.2$ RK4 diverges almost immediately (the trace truncates where $\abs{\text{state}} > 10^{8}$ is masked to NaN) — the explicit step-size bound $\stepsize \lesssim 1/\abs{\lambda_{\max}}$ of this section, made visible. **Right — backward Euler (implicit, L-stable, §6.2):** every $\stepsize$ stays on the bounded cycle; at the coarse $\stepsize = 0.2$ it stays stable but resolves the cycle only crudely — accuracy degrades, stability does not. Produced by \`companions/ch06/jax/stiff_demo.py\`; the divergence-vs-boundedness contrast is pinned by \`test_rk4_blows_up_at_coarse_dt\` and \`test_be_stable_at_coarse_dt\` in \`companions/ch06/jax/tests/test_stiff.py\`.

6.2 Implicit methods: backward Euler, BDF, DIRK

The fix for stiffness is to abandon explicit methods. Backward Euler is the simplest implicit method:

\statevec_{k+1} = \statevec_k + \stepsize f(\statevec_{k+1}).

Solving for $\statevec_{k+1}$ requires either a linear solve (for linear $f$ ) or a Newton iteration (for nonlinear $f$ ). The pay-off is the stability function

\stabfn_{\text{BE}}(z) = \frac{1}{1 - z},

which has $\abs{\stabfn_{\text{BE}}(z)} \le 1$ for every $z$ with $\operatorname{Re}(z) \le 0$ — backward Euler is A-stable. It is also L-stable: $\stabfn_{\text{BE}}(z) \to 0$ as $\operatorname{Re}(z) \to -\infty$ , so fast modes get damped aggressively as desired on stiff problems.

Proposition 6.1.

(Backward Euler is L-stable.) Backward Euler’s stability function $\stabfn(z) = 1/(1-z)$ satisfies $\abs{\stabfn(z)} \le 1$ for every $z \in \C$ with $\operatorname{Re}(z) \le 0$ , and $\stabfn(z) \to 0$ as $\operatorname{Re}(z) \to -\infty$ .

The proof is one line each: $\abs{1 - z}^2 = (1 - \operatorname{Re}(z))^2 + \operatorname{Im}(z)^2 \ge 1$ when $\operatorname{Re}(z) \le 0$ , giving $\abs{\stabfn(z)} \le 1$ . The limit is immediate.

Backward Euler is first-order accurate. Higher-order implicit methods come in several families.

BDF $k$ (Backward Differentiation Formula, $k$ th-order) is a multi-step method that approximates $\dot \statevec(t_{k+1})$ by a $k$ -point backward difference and equates it to $f(\statevec_{k+1})$ . BDF1 = backward Euler; BDF2 is second-order and L-stable; BDF $k$ for $k = 3, \ldots, 6$ are conditionally stable but widely used in stiff ODE codes. BDF7 and higher are not zero-stable (the characteristic root condition fails), so they are unusable. The order limit is the second Dahlquist barrier, a structural fact about multi-step methods that we will state without proof: no $A$ -stable linear multi-step method has order higher than 2.

Diagonally implicit RK (DIRK) sits between explicit RK and fully implicit RK. The $A$ matrix is lower triangular with non-zero diagonal, so each stage requires solving an independent nonlinear system (cheaper than the coupled system of fully implicit RK, but more expensive than the trivial explicit evaluations). Singly-DIRK (SDIRK) further constrains all diagonal entries to be equal, so the Newton Jacobian factorizes once per step. The Crank–Nicolson (trapezoidal) scheme is a 2-stage ESDIRK of order 2 — its explicit first stage and unequal diagonals place it just outside the strict SDIRK class.

Fully implicit RK — the most expensive family — includes Gauss–Legendre methods, which we will return to in §6.5 as the optimal symplectic methods.

The computational trade-off for implicit methods: each step costs $O(N^3)$ (linear solve for the Jacobian factor) or $O(\text{iterations} \cdot N^2)$ (Newton iteration for nonlinear $f$ ). For SSMs with $N = 16, 64$ this is acceptable; for $N$ in the thousands it begins to dominate. The Chapter 4 exponential-trapezoidal scheme has the same A-stability as backward Euler but only costs one matrix-exponential per timestep (vs one Newton solve), and the matrix exponential of a low-rank or diagonalized $\statemat$ is cheap. Mamba-3’s specific tactical advantage over a backward-Euler implementation comes down to this cost asymmetry.

6.3 Exponential integrators revisited

The exp-trapezoidal scheme of §4.5 is the simplest member of the exponential integrator family. Recall the general idea: given $\dot \statevec = \statemat \statevec + \inputmat u(t)$ , treat the linear part exactly via $e^{\statemat \stepsize}$ and approximate only the forcing integral. The $\varphi$ -function family $\varphi_k(z) = (e^z - \sum_{j=0}^{k-1} z^j/j!)/z^k$ provides building blocks for arbitrarily high order.

Three commonly used schemes:

exp-Euler: $\statevec_{k+1} = e^{\statemat \stepsize} \statevec_k + \stepsize \varphi_1(\statemat \stepsize) \inputmat u_k$ . Treats the input as piecewise constant (like ZOH); first-order for forced systems, exact on autonomous problems.
exp-midpoint: uses the input midpoint instead of $u_k$ ; second-order for symmetric input perturbations.
exp-trapezoidal (§4.5): linear interpolation of the input; second-order for $C^1$ inputs.
ETDRK4 Hochbruck & Ostermann (2010) : a four-stage exponential RK method that achieves fourth order. The Mamba-3 paper Lahoti et al. (2026) does not (yet) use ETDRK4; its second-order exp-trapezoidal is the empirical sweet spot.

All exponential integrators inherit A-stability from $e^{\statemat \stepsize}$ . They sidestep the Dahlquist barrier because their stability function is not a polynomial — it is $e^z$ times a rational-in- $z$ correction factor — and the polynomial constraint of explicit RK methods does not apply.

The implementation cost: one matrix exponential plus one or two $\varphi$ -function evaluations per step. For a low-rank $\statemat$ (which selective SSMs typically have, since the structured- $\statemat$ pattern of Chapter 7 carries forward), the matrix exponential is cheap. For dense general $\statemat$ exponentials cost $O(N^3)$ via Padé / scaling-and-squaring, the same as a Newton solve — so the choice between exp-trap and backward Euler often comes down to whether you want autonomous-exactness (exp-trap wins) or L-stability damping of high-frequency modes (backward Euler wins).

6.4 Hamiltonian mechanics, energy conservation, and the symplectic structure

The remainder of this chapter introduces the language of Hamiltonian systems and symplectic integration. This is where the direct-transfer hook from continuum mechanics applies most cleanly. The exposition is brief and standalone; for the full theory see Hairer–Lubich–Wanner Hairer et al. (2006) , Chapter VI.

A Hamiltonian system on phase space $(q, p) \in \R^n \times \R^n$ is governed by a scalar function $\hamilton(q, p)$ — the Hamiltonian, intended to represent total energy — through Hamilton’s equations:

\dot q = \frac{\partial \hamilton}{\partial p}, \qquad \dot p = -\frac{\partial \hamilton}{\partial q}.

The canonical example is the harmonic oscillator $\hamilton(q, p) = \tfrac{1}{2}(p^2 + \omega^2 q^2)$ , giving $\dot q = p$ , $\dot p = -\omega^2 q$ . The pendulum is $\hamilton(q, p) = \tfrac{1}{2} p^2 - \cos(q)$ ; vortex-dynamics flows are higher-dimensional but structurally identical.

Three properties make Hamiltonian systems geometrically special.

Energy conservation. For a $C^1$ Hamiltonian, along any trajectory $(q(t), p(t))$ , $\frac{d}{dt} \hamilton(q(t), p(t)) = \frac{\partial \hamilton}{\partial q} \dot q + \frac{\partial \hamilton}{\partial p} \dot p = \frac{\partial \hamilton}{\partial q} \frac{\partial \hamilton}{\partial p} - \frac{\partial \hamilton}{\partial p} \frac{\partial \hamilton}{\partial q} = 0$ . The Hamiltonian is a constant of motion.

Phase-space volume preservation (Liouville’s theorem). The flow $\Phi_t : (q_0, p_0) \mapsto (q(t), p(t))$ has Jacobian determinant 1 everywhere: $\det \nabla \Phi_t \equiv 1$ . Volumes in phase space are preserved exactly under the dynamics.

Symplectic structure. Define the symplectic 2-form $\symform = \sum_i dq_i \wedge dp_i$ . The flow $\Phi_t$ is a symplectic transformation: it preserves $\symform$ in the sense that $\Phi_t^* \symform = \symform$ where $\Phi_t^*$ denotes the pullback. Symplecticity is a strict refinement of volume preservation — every symplectic transformation is volume-preserving, but not vice versa.

The reason these properties matter for numerics is that standard order-optimized integrators destroy them. A fourth-order RK method simulates the harmonic oscillator with $O(\stepsize^4)$ global error in the state — but it does not exactly preserve energy. The energy error accumulates monotonically (one-sided, not oscillating around the initial value), producing a linear-in-time energy drift whose rate depends on the problem nonlinearity and step size.

On the linear harmonic oscillator at small $\stepsize$ the rate is impressively small: RK4’s stability function $R(z) = \sum_{j=0}^{4} z^j/j!$ is not time-symmetric (a symmetric method satisfies $R(z)R(-z) = 1$ ; RK4 instead gives $1 + z^6/72 + \cdots$ ), so on the imaginary axis $\abs{R(i\theta)}^2 = 1 - \theta^6/72 + \cdots < 1$ and the energy is very slightly, monotonically damped — the slow negative drift the symplectic_demo.py companion exhibits. It remains monotonic in a way symplectic methods are not.

As $\stepsize$ grows or the problem becomes nonlinear the rate becomes visible: at $\stepsize \approx 0.3$ over 1000 periods of the pendulum, RK4 drifts by $\sim 10^{-2}$ in energy and noticeably distorts the orbit; at large enough $\stepsize$ the orbit eventually escapes the bound regime. On a Kepler problem the analogous failure mode is eventual ejection or capture of the orbiting body. These are not numerical instabilities in the Chapter 5 sense — the method is A-stable for the small-amplitude regime — they are geometric errors: the wrong qualitative dynamics, made visible at sufficient horizon.

6.5 Symplectic integrators

A symplectic integrator is a numerical map $\Psi_\stepsize : (q, p) \mapsto (q', p')$ that preserves the symplectic 2-form: $\Psi_\stepsize^* \symform = \symform$ . Symplectic integrators do not, in general, preserve $\hamilton$ exactly — but they preserve a nearby modified Hamiltonian $\widetilde \hamilton = \hamilton + O(\stepsize^p)$ for a $p$ -th order symplectic method. As a consequence, the energy $\hamilton(q_k, p_k)$ along the discrete trajectory oscillates within a bounded band of width $O(\stepsize^p)$ around its initial value — not drifting linearly with $t$ , but staying near a level set of $\widetilde \hamilton$ . This is the backward error analysis of geometric integrators, and it is the practical reason symplectic methods are mandatory for long-horizon Hamiltonian simulations.

Three canonical schemes.

Symplectic Euler (1st order). For a separable Hamiltonian $\hamilton(q, p) = T(p) + V(q)$ :

p_{k+1} = p_k - \stepsize \, V'(q_k), \qquad q_{k+1} = q_k + \stepsize \, T'(p_{k+1}).

Update momentum first using position-derived force; then update position using the updated momentum. The asymmetry — using $p_{k+1}$ in the second step rather than $p_k$ — is what makes the scheme symplectic. A symmetric version updates $q$ first, then $p$ ; both are valid.

Störmer–Verlet (2nd order, symmetric). A symmetric version of symplectic Euler that does a half-step in momentum, full step in position, half-step in momentum:

\begin{aligned} p_{k+1/2} &= p_k - \tfrac{\stepsize}{2} V'(q_k), \\ q_{k+1} &= q_k + \stepsize \, T'(p_{k+1/2}), \\ p_{k+1} &= p_{k+1/2} - \tfrac{\stepsize}{2} V'(q_{k+1}). \end{aligned}

This is the algorithm used in essentially every molecular-dynamics simulation. Second-order accurate, symplectic, time-reversible (the same algorithm run with $-\stepsize$ exactly reverses the trajectory). The energy error is bounded by $O(\stepsize^2)$ over arbitrarily long horizons.

Gauss–Legendre IRK (order $2s$ , A-stable, symplectic). The $s$ -stage Gauss–Legendre Runge–Kutta method has nodes at the Gauss–Legendre quadrature points and is the unique $s$ -stage implicit RK method of order $2s$ (the maximum attainable for $s$ stages). The 1-stage method is the implicit midpoint rule (order 2, symplectic, A-stable). The 2-stage method is order 4. Gauss–Legendre IRK is the unique family that is simultaneously A-stable, symplectic, and of optimal order — making it the gold standard for high-accuracy long-horizon Hamiltonian simulation Hairer et al. (2006) .

Energy error vs time on the harmonic oscillator at Δ=0.3 over 100 periods, comparing classical RK4 against Störmer–Verlet. — Energy trajectory $E(t) - E_0$ on the harmonic oscillator $\\dot q = p$, $\\dot p = -q$ over 100 periods at $\\Delta = 0.3$. Both methods stay close to the initial energy $E_0 = 0.5$ at this horizon, but their character differs: Classical RK4 (gold) accumulates *monotonic* linear-in-time drift (~$10^{-2}$ at this $\\Delta$ over 100 periods, extrapolating linearly toward ~$10^{-1}$ by 1000 periods), while Störmer–Verlet (navy) *oscillates* within a *bounded* band of width $\\sim \\Delta^2/8 \\approx 10^{-2}$ that is constant regardless of horizon. At smaller $\\Delta$ (e.g., $\\Delta = 0.05$) RK4's drift becomes much smaller in absolute terms (~$10^{-6}$ at 100 periods) — but it remains monotonic and horizon-linear, while Verlet's band remains horizon-bounded. The defining qualitative property of symplectic vs non-symplectic integrators is the *constancy* of the symplectic band under horizon scaling, not the magnitude at any specific $(\\Delta, T)$. Produced by `companions/ch06/jax/symplectic_demo.py` (which prints the $\\Delta = 0.3$, 100-period drift); the qualitative monotonic-vs-bounded contrast is pinned in `companions/ch06/julia/runtests.jl`.

Theorem 6.2.

(Modified Hamiltonian for symplectic methods.) Let $\Psi_\stepsize$ be a symplectic integrator of order $p$ applied to a system with analytic Hamiltonian $\hamilton$ , and let the step size satisfy $\stepsize \le \stepsize_0$ for a problem-dependent threshold $\stepsize_0$ . Then there is a modified Hamiltonian $\widetilde \hamilton = \hamilton + \stepsize^p H_p + \stepsize^{p+1} H_{p+1} + \cdots$ whose optimally truncated partial sum $\widetilde \hamilton_N$ is preserved along the discrete trajectory up to an exponentially small error $O(e^{-c/\stepsize})$ over exponentially long times $O(e^{c/\stepsize})$ . The original $\hamilton$ therefore oscillates within a bounded $O(\stepsize^p)$ band along the discrete trajectory.

The proof — Hairer–Lubich–Wanner Chapter IX — uses backward error analysis to expand the modified Hamiltonian as an asymptotic series in $\stepsize$ and shows the series can be truncated with controlled remainder. The takeaway for practice: a $p$ -th order symplectic integrator is exponentially better than a $p$ -th order non-symplectic one on Hamiltonian problems, even though the local truncation error of both is $O(\stepsize^{p+1})$ . Local error is misleading for long-horizon problems; geometric structure is what matters.

6.6 Toward selective SSMs as Hamiltonian flows

The C1 niche pilot proposes the following research program. The hidden-state dynamics of a selective SSM can sometimes be written as a Hamiltonian flow on the hidden-state manifold; when they can, applying a symplectic integrator should reduce long-horizon error in a way that order-optimized integrators cannot match. The first step is to characterize when this Hamiltonian structure exists.

For a linear-time-invariant SSM $\dot \statevec = \statemat \statevec$ , the dynamics are Hamiltonian iff there exists a symmetric positive-definite $P$ such that $\statemat^\top P + P \statemat = 0$ . Equivalently, $\statemat$ is similar (via $P^{1/2}$ ) to a skew-symmetric matrix. The quadratic Hamiltonian is then $\hamilton(\statevec) = \tfrac{1}{2} \statevec^\top P \statevec$ , and the symplectic 2-form is $\symform = P^{-1}$ (in the standard symplectic-flat coordinates). For HiPPO-LegS the eigenvalues are all real-negative, not purely imaginary, so HiPPO-LegS dynamics are dissipative, not Hamiltonian. Symplectic methods do not apply. This is consistent with the empirical observation that ZOH and bilinear discretizations of HiPPO-LegS perform well — dissipation handles the long-horizon stability story.

For a selective SSM with input-dependent $\statemat(u)$ , the relevant question becomes: does $\statemat(u)$ have a Hamiltonian decomposition for some range of inputs, even if not for all? The C1 pilot’s empirical hypothesis is that during certain training phases the learned $\statemat(u)$ matrices drift toward eigenvalue spectra that are closer to purely imaginary than to the LHP, and that this drift is associated with the long-horizon recall artifacts observed in Mamba’s failure modes Halloran et al. (2025) . If this hypothesis holds, then symplectic discretization of selective SSMs should reduce those artifacts. The Chapter 17 pilot integration develops this thread; this chapter has supplied the necessary integration-theory vocabulary.

Phase portrait of the pendulum: both Störmer–Verlet and RK4 trace nearly identical closed orbits at this short horizon and small step. — Pendulum $\\hamilton = p^2/2 - \\cos q$ phase portrait over 50 periods at $\\Delta = 0.1$. Both Störmer–Verlet (navy) and Classical RK4 (gold) trace visually indistinguishable closed orbits at this $(\\Delta, T)$: each method's energy drift (Verlet: bounded oscillation $\\sim \\Delta^2$; RK4: slow monotonic drift) is below visual resolution at this $(\\Delta, T)$ against an orbit of radius 1. The qualitative geometric distinction — that Verlet's energy band is horizon-bounded while RK4's grows linearly — surfaces at longer horizons or larger $\\Delta$: at $\\Delta = 0.3$ over 1000 periods RK4 distorts noticeably while Verlet remains tight. The phase-portrait visualization here illustrates that *both* methods are formally stable in the Chapter 5 sense; the symplectic-vs-non-symplectic difference is a long-horizon property, not a short-horizon one. Produced by `companions/ch06/jax/symplectic_demo.py`.

6.7 What’s next

This chapter completes the foundations layer of the book (Part I). The next part — Chapters 7–10 — introduces structured SSMs (S4, S4D, S5, Mamba-1/2, Mamba-3) as discretizations of the continuous systems developed here. Chapter 7 sets up HiPPO theory; Chapter 8 derives the S4 / S4D / S5 architectures (and sorts out which discretization each paper actually used); Chapter 9 introduces selective scans (Mamba-1/2); Chapter 10 brings the exp-trapezoidal scheme of §4.5 to its home in Mamba-3.

The C1 pilot integration in Chapter 17 returns to this chapter’s symplectic story, applying the modified-Hamiltonian framework to learned selective dynamics. Readers focused on the SSM applications can skip to Chapter 7 now and consult §6.5–§6.6 when the symplectic discussion comes back.

6.8 Exercises

Six problems. Short/numerical (6.1–6.3) have inline collapsible solutions; long/proof exercises (6.4–6.6) have full worked solutions in §6.9.

Exercise 6.1 (computation)

Verify that backward Euler’s stability function is $\stabfn_{\text{BE}}(z) = 1/(1-z)$ by directly applying the scheme to the test problem $\dot y = \lambda y$ and solving for $y_{k+1}$ .

Solution

Backward Euler: $y_{k+1} = y_k + \stepsize \lambda y_{k+1}$ . Rearranging, $(1 - \stepsize \lambda) y_{k+1} = y_k$ , so $y_{k+1} = y_k / (1 - \stepsize \lambda)$ . Setting $z = \stepsize \lambda$ , the stability function is $\stabfn_{\text{BE}}(z) = 1/(1 - z)$ . ∎

Exercise 6.2 (computation)

For the harmonic oscillator $\dot q = p$ , $\dot p = -q$ (so $\hamilton = (p^2 + q^2)/2$ ), compute one step of symplectic Euler (“p first” variant) starting from $(q_0, p_0) = (1, 0)$ at $\stepsize = 0.1$ . What are $(q_1, p_1)$ , and what is $\hamilton(q_1, p_1)$ ?

Solution

Symplectic Euler ( $p$ first): $p_{k+1} = p_k - \stepsize \cdot q_k$ , then $q_{k+1} = q_k + \stepsize \cdot p_{k+1}$ .

Starting from $(1, 0)$ : $p_1 = 0 - 0.1 \cdot 1 = -0.1$ . Then $q_1 = 1 + 0.1 \cdot (-0.1) = 0.99$ . So $(q_1, p_1) = (0.99, -0.1)$ .

Energy: $\hamilton(q_1, p_1) = (0.99^2 + 0.1^2)/2 = (0.9801 + 0.01)/2 = 0.49505$ . Initial energy was $0.5$ . The energy is not exactly preserved (it changed by $-0.00495 \approx -\stepsize^2/2$ ), but it is bounded — over many steps it will oscillate, not drift.

Exercise 6.3 (computation + code)

Run companions/ch06/jax/symplectic_demo.py. Verify that classical RK4 produces linear-in-time energy drift on the harmonic oscillator while Störmer–Verlet’s energy stays in a bounded band. What is the drift rate (energy per period) of RK4 at $\stepsize = 0.05$ ?

Solution

The companion’s main output shows two energy-vs-time traces over 100 periods. RK4 drifts approximately linearly; the drift rate at $\stepsize = 0.05$ is roughly $1.4 \times 10^{-8}$ per period — small in absolute terms but growing linearly with horizon (so ~ $1.4 \times 10^{-6}$ after 100 periods, ~ $1.4 \times 10^{-5}$ after 1000). The linear-time growth rate, not the absolute magnitude, is the diagnostic. Störmer–Verlet’s energy stays within a bounded band of width $\sim \stepsize^2/8 \approx 3 \times 10^{-4}$ , with no secular drift no matter how long the run. The contrast — monotonic linear-in-time growth vs bounded oscillation — is the key takeaway of §6.5, and is the horizon-invariance of the symplectic band rather than the absolute magnitude at any specific $(\stepsize, T)$ that defines a symplectic method.

Exercise 6.4 (theory) — solution in §6.9

Prove the L-stability of backward Euler (Proposition 6.1): $\abs{\stabfn_{\text{BE}}(z)} \le 1$ for $\operatorname{Re}(z) \le 0$ , and $\stabfn_{\text{BE}}(z) \to 0$ as $\operatorname{Re}(z) \to -\infty$ .

Exercise 6.5 (theory) — solution in §6.9

Show that symplectic Euler is symplectic. That is, prove that the Jacobian $\partial(q_{k+1}, p_{k+1})/\partial(q_k, p_k)$ has determinant exactly 1 for the harmonic oscillator. (Then note: for separable Hamiltonians, the same argument applies via the chain rule.)

Exercise 6.6 (theory) — solution in §6.9

Show that for a linear-time-invariant ODE $\dot \statevec = \statemat \statevec$ with $\statemat$ skew-symmetric ( $\statemat^\top = -\statemat$ ), the function $\hamilton(\statevec) = \tfrac{1}{2} \statevec^\top \statevec$ is a constant of motion. Use this to give a sufficient condition (in terms of $\statemat$ only) for an SSM to be Hamiltonian.

6.9 Full solutions to theory exercises

Solution to Exercise 6.4

Write $z = -r + i\omega$ with $r \ge 0$ . Then $\abs{1 - z}^2 = (1 + r)^2 + \omega^2$ . Since $r \ge 0$ , $(1 + r)^2 \ge 1$ , so $\abs{1 - z}^2 \ge 1$ . Therefore $\abs{\stabfn_{\text{BE}}(z)} = 1/\abs{1 - z} \le 1$ . ∎

For the L-stability limit: as $r \to \infty$ with $\omega$ bounded, $\abs{1 - z}^2 = (1 + r)^2 + \omega^2 \to \infty$ , so $\stabfn_{\text{BE}}(z) \to 0$ . ∎

Solution to Exercise 6.5

For the harmonic oscillator $\dot q = p$ , $\dot p = -q$ , symplectic Euler ( $p$ first) gives

\begin{aligned} p_{k+1} &= p_k - \stepsize \, q_k, \\ q_{k+1} &= q_k + \stepsize \, p_{k+1} = q_k + \stepsize (p_k - \stepsize \, q_k) = (1 - \stepsize^2) q_k + \stepsize p_k. \end{aligned}

The Jacobian is

J = \frac{\partial(q_{k+1}, p_{k+1})}{\partial(q_k, p_k)} = \begin{pmatrix} 1 - \stepsize^2 & \stepsize \\ -\stepsize & 1 \end{pmatrix}.

The determinant: $\det J = (1 - \stepsize^2)(1) - \stepsize \cdot (-\stepsize) = 1 - \stepsize^2 + \stepsize^2 = 1$ . ∎

So the discrete map preserves the symplectic 2-form $dq \wedge dp$ exactly. This is the algebraic content of “symplectic Euler is symplectic” — the determinant being exactly $1$ , not $1 + O(\stepsize^p)$ .

For a general separable Hamiltonian $\hamilton = T(p) + V(q)$ , symplectic Euler gives $p_{k+1} = p_k - \stepsize V'(q_k)$ and $q_{k+1} = q_k + \stepsize T'(p_{k+1})$ . The Jacobian factorizes as

J = \begin{pmatrix} 1 & 0 \\ -\stepsize V''(q_k) & 1 \end{pmatrix} \cdot \begin{pmatrix} 1 & \stepsize T''(p_{k+1}) \\ 0 & 1 \end{pmatrix},

each of which has determinant 1 (lower- and upper-triangular with 1s on the diagonal). The product determinant is therefore 1. ∎

This “shear-decomposition” argument generalizes to all symplectic integrators that update $q$ and $p$ in alternating steps: each individual sub-step is a shear, with unit determinant. Verlet, leapfrog, Yoshida composition methods — all are products of unit-determinant shears.

Solution to Exercise 6.6

For $\dot \statevec = \statemat \statevec$ with $\statemat^\top = -\statemat$ , compute

\frac{d}{dt} \hamilton(\statevec) = \frac{d}{dt} \tfrac{1}{2} \statevec^\top \statevec = \tfrac{1}{2} (\dot \statevec^\top \statevec + \statevec^\top \dot \statevec) = \tfrac{1}{2} (\statevec^\top \statemat^\top \statevec + \statevec^\top \statemat \statevec) = \tfrac{1}{2} \statevec^\top (\statemat^\top + \statemat) \statevec = 0,

since $\statemat^\top + \statemat = 0$ by skew-symmetry. So $\hamilton(\statevec(t)) \equiv \hamilton(\statevec_0)$ — the squared norm is exactly preserved. ∎

Sufficient condition for an SSM to be Hamiltonian. Any SSM whose continuous dynamics matrix $\statemat$ is skew-symmetric is Hamiltonian with Hamiltonian $\hamilton(\statevec) = \tfrac{1}{2} \statevec^\top \statevec$ . The eigenvalues of a real skew-symmetric matrix are purely imaginary (or zero), so this regime is exactly the purely oscillatory corner of the SSM landscape — the opposite of the LHP-stable corner where HiPPO and S4 live. A more general sufficient condition: $\statemat$ has a Hamiltonian structure (in the matrix-Lie-algebra sense) — equivalently, there exists a symmetric positive-definite $P$ with $\statemat^\top P + P \statemat = 0$ . The C1 pilot’s empirical project is to identify selective-SSM regimes where this condition holds approximately, and to apply symplectic discretization there.

6.10 Companion code

Two JAX companions, two PyTorch companions, and two Julia companions for Chapter 6.

JAX (companions/ch06/jax/):

stiff_demo.py — van der Pol oscillator at $\mu = 10$ (mildly stiff); compares explicit RK4 (blows up at large $\stepsize$ ) against backward Euler (stable at every $\stepsize$ ); emits stiff_blowup.png.
symplectic_demo.py — harmonic oscillator + pendulum; runs classical RK4 vs Störmer–Verlet over 100 periods; emits energy_drift.png and phase_portrait.png.

Julia (companions/ch06/julia/):

implicit_methods.jl — backward Euler, BDF2, and DIRK from-scratch implementations on a stiff test problem; uses only LinearAlgebra + Printf.
symplectic_methods.jl — symplectic Euler, Störmer–Verlet, and the 2-stage Gauss–Legendre IRK method on a Hamiltonian test; emits empirical order-of-accuracy table.
Project.toml / Manifest.toml.

PyTorch (companions/ch06/torch/):

stiff_demo.py — the van der Pol RHS/Jacobian with RK4 and backward-Euler steppers (compute-only; the JAX companion produces the figure).
symplectic_demo.py — the RK4 and Störmer–Verlet steppers on the harmonic oscillator and pendulum.
tests/ — cross-framework parity: the torch trajectories and energy-drift quantities equal their JAX counterparts to within $10^{-9}$ (float64).

To run from the repo root:

# JAX
PYTHONPATH=. python companions/ch06/jax/stiff_demo.py
PYTHONPATH=. python companions/ch06/jax/symplectic_demo.py

# PyTorch (needs the .venv [torch] extra; parity only, no figures)
PYTHONPATH=. python companions/ch06/torch/stiff_demo.py
PYTHONPATH=. python companions/ch06/torch/symplectic_demo.py

# Julia (lightweight — no DifferentialEquations.jl dependency)
julia --project=companions/ch06/julia companions/ch06/julia/implicit_methods.jl
julia --project=companions/ch06/julia companions/ch06/julia/symplectic_methods.jl

All figures emit to public/figures/ch06/.

Part II · SSM Core Week 7 Published

HiPPO theory: orthogonal-basis projection operators

Online function approximation by projection onto orthogonal polynomial bases — the HiPPO-LegS operator, its lower-triangular structure and exact spectrum, discretization to the S4 recurrence, and the basis-conditioning question.

HiPPO theory: orthogonal-basis projection operators

Chapter 7 — at a glance

Goal: answer the question Chapter 4 left open — what should the matrix $\statemat$ be? The HiPPO framework Gu et al. (2020) derives a structured $\statemat$ from a single principle: the state should hold the coefficients of the best $N$ -term approximation of the input history $u(s),\, s\in[0,t]$ , in an orthogonal polynomial basis. We build the HiPPO-LegS operator from that principle, prove it is lower-triangular with spectrum exactly $\{-1,\dots,-N\}$ , discretize it with Chapter 4’s zero-order hold to recover an S4-style recurrence, and confront the basis-conditioning issue that drives one of this book’s research niches.

Reading time: ~45 minutes prose; 90+ minutes with the JAX, Julia, and PyTorch companions.

Direct-transfer hook (C1 pilot): the HiPPO-LegS matrix is the structured $\statemat$ whose discretization the symplectic-integrator pilot scrutinizes. Its exact eigenvalue structure (§7.7) and the conditioning of its basis (§7.5) are precisely the objects a geometric-integration analysis of S4 / Mamba begins from.

Key insight: HiPPO turns “memory” into “online orthogonal projection.” Random $\statemat$ gives vanishing or exploding dynamics; the HiPPO $\statemat$ is the projection-update operator of a Legendre expansion, and its structure (lower-triangular, real spectrum on the negative integers) is exactly what makes long-range memory both expressive and numerically stable.

7.1 The online function-approximation problem

A recurrent model carries a finite state $\statevec_t \in \R^N$ and updates it one sample at a time. The central question is what should that state store? HiPPO’s answer is sharp: the state should store a compressed representation of the entire input history, chosen so that the history can be reconstructed as accurately as $N$ numbers allow.

Make this precise. Fix a time $t > 0$ and consider the input seen so far, $u(s)$ for $s \in [0, t]$ . Equip the interval with a measure $\omega_t(s)$ (a weighting that says which parts of the history matter). The best $N$ -term approximation of $u$ in the weighted space $L^2(\omega_t)$ is its orthogonal projection onto an $N$ -dimensional subspace spanned by basis functions $g_0, \dots, g_{N-1}$ :

u(s) \;\approx\; \sum_{n=0}^{N-1} c_n(t)\, g_n(s), \qquad c_n(t) = \langle u, g_n \rangle_{\omega_t} = \int_0^t u(s)\, g_n(s)\, \omega_t(s)\, ds.

The coefficient vector $c(t) = (c_0(t), \dots, c_{N-1}(t))^\top$ is the compressed memory. HiPPO’s discovery is that for well-chosen bases $c(t)$ obeys a linear ODE in $t$ — so maintaining the optimal projection online is exactly running a linear state-space model.

The picture to keep in mind is reconstruction. Drive a signal into the operator, read off $c(t)$ at the final time, and expand $\sum_n c_n g_n$ : you recover the history, and the recovery sharpens as $N$ grows.

Left: a two-sinusoid history and its HiPPO-LegS reconstructions at N=4, 16, 64, converging to the true signal. Right: relative L2 reconstruction error versus N on a log scale, dropping sharply then plateauing. — HiPPO-LegS as online function approximation. Left: the input history (dark) and its reconstruction from the HiPPO state at N=4, 16, 64. Right: relative $L^2$ error falls from $0.80$ at $N=4$ to $0.011$ at $N=16$, then plateaus near $7\times10^{-3}$ once the state dimension exceeds the signal's intrinsic complexity. Produced by companions/ch07/jax/hippo_reconstruction.py.

7.2 Orthogonal polynomial bases

The natural basis for functions on an interval is the Legendre polynomials $P_n$ . On $[-1, 1]$ they are orthogonal under the uniform measure,

\int_{-1}^{1} P_n(x)\, P_m(x)\, dx = \frac{2}{2n+1}\,\delta_{nm},

and satisfy the three-term recurrence $(n+1)P_{n+1}(x) = (2n+1)\,x\,P_n(x) - n\,P_{n-1}(x)$ with $P_0 = 1$ , $P_1(x) = x$ . Rescaling to the unit interval $z \in [0,1]$ and normalizing gives an orthonormal family

\tilde P_n(z) = \sqrt{2n+1}\,P_n(2z - 1), \qquad \int_0^1 \tilde P_n(z)\, \tilde P_m(z)\, dz = \delta_{nm}.

These are the basis functions HiPPO projects onto; the companion’s Gram-matrix check confirms their orthonormality to $10^{-3}$ under trapezoidal quadrature.

The first five normalized shifted Legendre basis functions on the unit interval, increasing in oscillation with degree. — The normalized shifted Legendre basis $\tilde P_n(z) = \sqrt{2n+1}\,P_n(2z-1)$ for $n = 0,\dots,4$ on $z \in [0,1]$. Higher-degree polynomials oscillate more and resolve finer detail of the history. Produced by companions/ch07/jax/hippo_reconstruction.py.

Orthogonality is what makes projection cheap and optimal. Because the basis is orthonormal, the best approximation is obtained coefficient-by-coefficient — no linear system to solve.

Theorem 7.1.

(Orthogonal projection is the best approximation.) Let $\{g_n\}_{n=0}^{N-1}$ be orthonormal in $L^2(\omega)$ and let $V = \operatorname{span}\{g_n\}$ . For any $u \in L^2(\omega)$ , the coefficients $c_n = \langle u, g_n\rangle_\omega$ minimize the reconstruction error: $\big\lVert u - \sum_n c_n g_n \big\rVert_\omega \le \big\lVert u - v \big\rVert_\omega$ for every $v \in V$ , with equality only when $v = \sum_n c_n g_n$ .

The proof (Exercise 7.5, §7.10) is the Hilbert-space Pythagorean theorem: the residual $u - \sum_n c_n g_n$ is orthogonal to $V$ , so any other $v \in V$ only adds an orthogonal error that grows the norm.

The measure $\omega_t$ encodes which history matters. Two choices dominate the HiPPO literature Gu et al. (2020) :

LegT (translated Legendre): a sliding window of fixed length $\theta$ — the state remembers the most recent $\theta$ time units uniformly and forgets everything older. This is a fixed-horizon memory.
LegS (scaled Legendre): the growing window $[0,t]$ with uniform weight, rescaled as $t$ increases. The state remembers all history, allocating its $N$ coefficients to the whole past at every scale. This is the variant that gives S4 its long-range reach, and the one we develop in detail.

7.3 The HiPPO-LegS operator

For the scaled measure the projection coefficients obey a strikingly clean ODE. Differentiating $c_n(t) = \langle u, \tilde P_n\rangle$ through the time-dependence of the measure and using the Legendre recurrence yields the HiPPO-LegS dynamics, a linear state-space system whose matrices have a closed form.

Proposition 7.2.

(HiPPO-LegS closed form.) The scaled-Legendre projection coefficients evolve under a linear system with state matrix $\statemat \in \R^{N\times N}$ and input matrix $\inputmat \in \R^{N\times 1}$ given by

\statemat_{ij} = -\begin{cases} \sqrt{(2i+1)(2j+1)} & i > j \\ \,i + 1 & i = j \\ \,0 & i < j \end{cases} \qquad \inputmat_{i} = \sqrt{2i+1},

for $i, j = 0, \dots, N-1$ . In particular $\statemat$ is lower-triangular.

The derivation (Exercise 7.4, §7.10) substitutes the Legendre three-term recurrence into the time-derivative of the projection integral; the lower-triangular structure is inherited from the recurrence, which expresses $P_{n+1}$ using only lower-degree polynomials. The matrix has a vivid signature: a negative diagonal that grows linearly, and negative off-diagonal entries below it that grow like $\sqrt{ij}$ .

Heatmap of the 8x8 HiPPO-LegS A matrix: zero above the diagonal, negative on and below it, with magnitudes growing toward the bottom-left. — The HiPPO-LegS state matrix $\statemat$ for $N=8$. Strictly upper-triangular entries are exactly zero; the diagonal is $-(i{+}1)$ and the sub-diagonal block grows in magnitude like $-\sqrt{(2i{+}1)(2j{+}1)}$. The lower-triangular structure is the algebraic fingerprint of the Legendre recurrence. Produced by companions/ch07/jax/hippo_matrix.py.

[Note] Two conventions, one matrix. The exact LegS operator is time-varying,

\ddt c(t) = \tfrac{1}{t}\!\left(-\statemat_{+}\, c + \inputmat\, u\right)

with the positive-eigenvalue matrix

\statemat_{+} = -\statemat

(this is what hippo_reconstruction.py integrates to produce Figure 7.1). The S4 lineage instead freezes a linear time-invariant surrogate

\ddt \statevec = \statemat\,\statevec + \inputmat\, u

with the stable

\statemat

above (spectrum in the left half-plane) and a learnable timescale

\stepsize

. Same structured matrix; the sign and the

1/t

scaling are the difference between the exact operator and the LTI model that gets discretized.

This is the answer to Chapter 4’s open question. Where Chapter 4 asked “given $\statemat$ , how do we discretize it?”, HiPPO supplies the $\statemat$ — not by guessing, but as the projection-update operator of an optimal polynomial memory. Random initialization of $\statemat$ produces eigenvalues scattered across the complex plane and dynamics that vanish or explode over long horizons; the HiPPO $\statemat$ is engineered so that its discretization gives stable, long-memory recurrences Gu et al. (2020) .

7.4 From the operator to the S4-style recurrence

To run the HiPPO operator on sampled data we discretize the LTI system $\ddt \statevec = \statemat \statevec + \inputmat u$ exactly as in Chapter 4 §4.3. The zero-order hold gives

\discA = e^{\statemat \stepsize}, \qquad \discB = \statemat^{-1}\!\left(e^{\statemat \stepsize} - I\right)\inputmat, \qquad \statevec_{k+1} = \discA\, \statevec_k + \discB\, u_k,

computed in practice through the augmented matrix-exponential trick of Chapter 4 (the HiPPO $\statemat$ is invertible — that promise from §4.3 is kept here). Because every eigenvalue of $\statemat$ has negative real part (§7.7), Chapter 4’s ZOH stability proposition applies verbatim: every discrete eigenvalue satisfies $\abs{\mu_i} = e^{\operatorname{Re}(\lambda_i)\stepsize} < 1$ , so the recurrence is stable for any step size. ZOH is this book’s default discretization throughout Chapters 7–9 precisely because it preserves the HiPPO spectrum so cleanly (the original S4 paper used the bilinear transform instead, and S4D supports either; Chapter 4 §4.6 lays out the trade-off).

This recurrence is the recurrent view of an S4-style layer Gu et al. (2022) . Chapter 8 adds the second view — unrolling the recurrence into a convolution and computing it with an FFT — and shows the two views are identical for zero initial state. The HiPPO matrix is exactly the initialization that S4 places on $\statemat$ before training; S4D Gu et al. (2022) then diagonalizes it for speed, a move §7.5 explains is delicate precisely because of conditioning. Everything S4 does is downstream of the operator in §7.3.

7.5 Conditioning of the HiPPO basis

The HiPPO-LegS matrix is lower-triangular and its eigenvalues are the well-separated integers $-1, \dots, -N$ (§7.7), which sounds maximally benign. It is not, and the reason is non-normality: $\statemat$ is far from symmetric, so its eigenvectors are highly oblique and the matrix that diagonalizes it is ill-conditioned. The condition number of that eigenvector basis grows exponentially in $N$ Yu et al. (2023) , so a naive diagonalization of HiPPO-LegS is numerically untrustworthy at the $N$ used in practice — and the sub-quadratic conditioning one wants is achieved only by deliberately perturbing the transform (the “robustifying” construction below).

Two consequences follow. First, diagonalization is dangerous: S4D’s diagonal reparameterization Gu et al. (2022) must be done in a numerically careful (often complex, approximately-diagonal) form, and the “robustifying” line of work Yu et al. (2023) exists exactly to tame the conditioning of the diagonalizing transform. Second, the choice of discretization interacts with conditioning: a poorly conditioned basis amplifies the per-step error of a low-order scheme, so the second-order schemes of Chapter 4 (and the geometric integrators of Chapter 6) are not academic refinements but a response to a real failure mode. This conditioning question is the seed of research niche C4 — characterizing where HiPPO-style projections become numerically untrustworthy and what to do about it.

7.6 Alternative bases

Legendre is one choice among many; HiPPO is a framework parameterized by the basis–measure pair Gu et al. (2023) . Swapping the basis changes what the state is good at remembering:

LagT (Laguerre): an exponentially-weighted infinite horizon — the measure decays into the past, so the state emphasizes recent history with a graceful tail.
FouT (Fourier): a Fourier basis on a sliding window, well matched to periodic or oscillatory inputs; closely related to the implicit long convolutions of Chapter 11.
Wavelet bases (WaLRUS): replacing polynomials with a redundant wavelet frame Babaei et al. (2025) shifts the optimal-projection target from polynomial smoothness to wavelet sparsity, which can suit signals with localized or multi-scale structure better than a global polynomial basis.

The “How to Train Your HiPPO” generalization Gu et al. (2023) derives the projection ODE for arbitrary measure–basis pairs and identifies which ones are well-conditioned to train — directly connecting basis choice back to §7.5. The lesson is that the structure of $\statemat$ (lower-triangular, stable spectrum, projection-update semantics) survives the basis change; the closed-form entries do not. When a later chapter reaches for a non-Legendre SSM, this is the machinery underneath.

7.7 Eigenvalue structure and stability

Return to the matrix itself. Because $\statemat$ is lower-triangular, its eigenvalues are exactly its diagonal entries.

Proposition 7.3.

(HiPPO-LegS spectrum.) The eigenvalues of the HiPPO-LegS matrix $\statemat$ are exactly $\{-1, -2, \dots, -N\}$ — real, distinct, and strictly negative. Consequently the continuous system $\ddt \statevec = \statemat\statevec + \inputmat u$ is asymptotically stable (Chapter 2), and its ZOH discretization has all discrete eigenvalues strictly inside the unit disk for every $\stepsize > 0$ (Chapter 4).

The proof is immediate from lower-triangularity (Exercise 7.6, §7.10). The companion verifies it numerically: for $N$ up to $32$ the computed eigenvalues match $-1, \dots, -N$ to better than $10^{-8}$ , and the strict upper triangle is exactly zero.

The 16 eigenvalues of the HiPPO-LegS matrix lie on the negative real axis at the integers minus 1 through minus 16, inside the shaded left half-plane. — Spectrum of the HiPPO-LegS $\statemat$ for $N=16$: the eigenvalues are the negative integers $-1, \dots, -16$, all real and in the open left half-plane (shaded). The integer spectrum is what gives HiPPO its graded multi-timescale memory — mode $n$ decays at rate $n+1$. Produced by companions/ch07/jax/hippo_matrix.py.

The integer spectrum has a memory interpretation: under ZOH the $n$ -th mode decays per step by $e^{-(n+1)\stepsize}$ , so low-index coefficients (coarse features of the history) are long-lived while high-index coefficients (fine detail) fade fastest. The state is a graded, multi-timescale memory — and the learnable $\stepsize$ slides the whole schedule. This is the dynamical-systems reading of why HiPPO works, and it is the structure a stability or Lyapunov analysis of a trained S4 layer (Chapter 15) measures against.

7.8 What’s next

This chapter answered “what is $\statemat$ ?” with “the projection-update operator of an optimal polynomial memory,” and showed that operator is lower-triangular, stably-spectrumed, and ZOH-discretizable into a recurrence. Chapter 8 takes that recurrence and develops the full S4 / S4D / S5 machinery: the convolutional view, the FFT kernel, the diagonalization that §7.5 warned is delicate, and the parallel scan. Chapter 9 then makes the parameters input-dependent — the selective SSMs (Mamba) — and Chapter 10 swaps ZOH for the exponential-trapezoidal scheme of Chapter 4 when the dynamics turn stiff. HiPPO is the hinge: every architecture from here inherits the structured $\statemat$ derived in §7.3.

7.9 Exercises

Six problems mixing computation and theory. Short/numerical (7.1–7.3) have inline collapsible solutions; the theory exercises (7.4–7.6) have full worked solutions in §7.10.

Exercise 7.1 (computation)

Using the three-term recurrence, write down $P_0, P_1, P_2, P_3$ and verify by direct integration that $\int_{-1}^{1} P_1(x) P_3(x)\, dx = 0$ . Then confirm $\int_{-1}^1 P_2(x)^2\, dx = 2/5$ .

Solution

$P_0 = 1,\ P_1 = x,\ P_2 = \tfrac12(3x^2 - 1),\ P_3 = \tfrac12(5x^3 - 3x)$ . The product $P_1 P_3 = \tfrac12(5x^4 - 3x^2)$ is even, and $\int_{-1}^1 (5x^4 - 3x^2)\,dx = 5\cdot\tfrac{2}{5} - 3\cdot\tfrac{2}{3} = 2 - 2 = 0$ , so the integral is $0$ — Legendre orthogonality. For the norm, $\int_{-1}^1 \tfrac14(3x^2-1)^2 dx = \tfrac14\int_{-1}^1 (9x^4 - 6x^2 + 1)\,dx = \tfrac14(9\cdot\tfrac25 - 6\cdot\tfrac23 + 2) = \tfrac14(\tfrac{18}{5} - 4 + 2) = \tfrac14\cdot\tfrac{8}{5} = \tfrac{2}{5}$ , matching $2/(2\cdot2+1)$ .

Exercise 7.2 (computation)

Build the $4\times 4$ HiPPO-LegS matrix $\statemat$ from the closed form of §7.3 by hand. Verify that $\statemat_{31} = -\sqrt{21}$ and that the diagonal is $(-1, -2, -3, -4)$ .

Solution

Entry $(3,1)$ has $i = 3 > j = 1$ , so $\statemat_{31} = -\sqrt{(2\cdot3+1)(2\cdot1+1)} = -\sqrt{7\cdot3} = -\sqrt{21} \approx -4.583$ . The diagonal entries are $\statemat_{ii} = -(i+1)$ for $i = 0,1,2,3$ , i.e. $-1, -2, -3, -4$ . The full matrix is lower-triangular with $\statemat_{10} = -\sqrt{3},\ \statemat_{20} = -\sqrt{5},\ \statemat_{21} = -\sqrt{15},\ \statemat_{30} = -\sqrt{7},\ \statemat_{32} = -\sqrt{35}$ , and zeros above the diagonal. The companion hippo_matrix.py prints exactly $\statemat_{31} = -4.582576$ .

Exercise 7.3 (computation + code)

Run companions/ch07/jax/hippo_reconstruction.py. Confirm that the relative $L^2$ reconstruction error of the two-sinusoid test signal falls sharply from $N=4$ to $N=16$ , then plateaus. Why does it plateau rather than continue to zero?

Solution

The companion prints errors $\approx 0.80\ (N{=}4),\ 0.39\ (N{=}8),\ 0.011\ (N{=}16),\ 0.0072\ (N{=}32),\ 0.0072\ (N{=}64)$ . The sharp drop happens once $N$ is large enough to represent the signal’s two frequencies (≈ 1.5 and 4 cycles on the window); beyond that the error plateaus at the discretization floor of the time-varying LegS recurrence at the chosen sample count ( $L=1000$ ), not at zero. Adding more Legendre modes cannot beat the error already incurred by integrating the projection ODE on a finite grid — exactly the conditioning/discretization interaction of §7.5. The Julia and PyTorch companions reproduce these numbers to machine precision.

Exercise 7.4 (theory) — solution in §7.10

Show that the HiPPO-LegS matrix $\statemat$ is lower-triangular, and that its diagonal entries are $\statemat_{ii} = -(i+1)$ , by appealing to the structure of the Legendre three-term recurrence. (You may take the off-diagonal closed form as given; the point is the structure.)

Exercise 7.5 (theory) — solution in §7.10

Prove the best-approximation theorem (Theorem, §7.2): if $\{g_n\}$ is orthonormal in $L^2(\omega)$ and $c_n = \langle u, g_n\rangle_\omega$ , then $\sum_n c_n g_n$ is the unique minimizer of $\lVert u - v\rVert_\omega$ over $v \in \operatorname{span}\{g_n\}$ .

Exercise 7.6 (theory) — solution in §7.10

Using lower-triangularity, prove that the eigenvalues of $\statemat$ are exactly $\{-1, \dots, -N\}$ . Then show that ZOH discretization sends every eigenvalue strictly inside the unit disk for all $\stepsize > 0$ , and identify the per-step decay rate of the $n$ -th mode.

7.10 Full solutions to theory exercises

Solution to Exercise 7.4

The shifted Legendre polynomials satisfy a three-term recurrence in which $P_{n}$ is expressed through $P_{n-1}$ and $P_{n-2}$ — never through higher-degree polynomials. When the projection coefficient $c_n(t) = \langle u, \tilde P_n\rangle_{\omega_t}$ is differentiated in $t$ , the time-derivative of the (scaled) measure couples $c_n$ to the inner products of $u$ against $\tilde P_m$ — and the recurrence guarantees only $m \le n$ appear. Hence $\ddt c_n$ depends on $c_0, \dots, c_n$ but not on $c_{n+1}, \dots$ , which is exactly the statement that $\statemat$ has zero entries above the diagonal: $\statemat_{ij} = 0$ for $i < j$ . For the diagonal, the “self-coupling” term of the $n$ -th coefficient under the scaled measure contributes $-(n+1)$ (the normalization $\sqrt{2n+1}$ and the measure’s $1/t$ scaling combine to the integer $n+1$ ); this is the $\statemat_{ii} = -(i+1)$ claim. The off-diagonal entries $-\sqrt{(2i+1)(2j+1)}$ for $i>j$ follow from the same expansion carried through the normalization, given here without the full bookkeeping. $\blacksquare$

Solution to Exercise 7.5

Let $p = \sum_n c_n g_n$ with $c_n = \langle u, g_n\rangle_\omega$ , and let $v = \sum_n a_n g_n$ be any element of $V = \operatorname{span}\{g_n\}$ . First, the residual $u - p$ is orthogonal to every $g_m$ :

\langle u - p, g_m\rangle_\omega = \langle u, g_m\rangle_\omega - \sum_n c_n \langle g_n, g_m\rangle_\omega = c_m - c_m = 0,

using orthonormality $\langle g_n, g_m\rangle = \delta_{nm}$ . Therefore $u - p \perp V$ , and in particular $u - p \perp (p - v)$ since $p - v \in V$ . Now decompose and apply the Pythagorean theorem:

\lVert u - v\rVert_\omega^2 = \lVert (u - p) + (p - v)\rVert_\omega^2 = \lVert u - p\rVert_\omega^2 + \lVert p - v\rVert_\omega^2 \ge \lVert u - p\rVert_\omega^2,

with equality iff $\lVert p - v\rVert_\omega = 0$ , i.e. $v = p$ . So $p$ is the unique minimizer. $\blacksquare$

Solution to Exercise 7.6

A lower-triangular matrix has its eigenvalues on the diagonal: $\det(\statemat - \lambda I)$ is the product of diagonal entries $\prod_i (\statemat_{ii} - \lambda)$ because the determinant of a triangular matrix is the product of its diagonal, and subtracting $\lambda I$ keeps it triangular. With $\statemat_{ii} = -(i+1)$ the roots are $\lambda_i = -(i+1)$ for $i = 0,\dots,N-1$ , i.e. exactly $\{-1, -2, \dots, -N\}$ — real, distinct, strictly negative.

Under ZOH the discrete dynamics matrix is $\discA = e^{\statemat\stepsize}$ , whose eigenvalues are $\mu_i = e^{\lambda_i \stepsize}$ . Then

\abs{\mu_i} = \abs{e^{\lambda_i \stepsize}} = e^{\operatorname{Re}(\lambda_i)\stepsize} = e^{-(i+1)\stepsize} < 1 \quad\text{for all } \stepsize > 0,

so every discrete eigenvalue lies strictly inside the unit disk (A-stability, Chapter 4). The per-step decay rate of mode $n$ is $e^{-(n+1)\stepsize}$ : higher-index modes decay faster, giving the graded multi-timescale memory of §7.7. $\blacksquare$

7.11 Companion code

Three runnable companions for Chapter 7 — the first SSM-core chapter, where the foundations and numerics threads meet in one operator. All three language tracks produce the same HiPPO-LegS matrix and the same reconstruction errors, which is itself the lesson: the mathematics is framework-independent; the idioms differ.

JAX (companions/ch07/jax/):

hippo_matrix.py — builds the HiPPO-LegS $\statemat, \inputmat$ from the §7.3 closed form with a vectorized jnp.where (no Python loop, jit-fusable); emits hippo_matrix_structure.png and hippo_eigenvalues.png.
hippo_reconstruction.py — runs the time-varying LegS recurrence via jax.lax.scan and reconstructs the history; emits legendre_basis.png and hippo_reconstruction.png.
tests/test_hippo.py — pins the closed form, the integer spectrum, the reconstruction-error decay, and a lax.scan-vs-naive-loop equivalence guard.

Julia (companions/ch07/julia/):

hippo_legendre.jl — the same operator with stdlib LinearAlgebra only; the encoder is an explicit for loop (the contrast to lax.scan), and the Legendre basis is built from the three-term recurrence.
runtests.jl — @testset assertions including an orthonormality Gram-matrix check.

PyTorch (companions/ch07/torch/):

hippo_operator.py — HiPPO-LegS wrapped as an nn.Module projection operator (fixed buffers, zero learnable parameters); a define-by-run eager loop encodes the signal. This is the object Chapter 8’s S4 layer extends.
tests/test_hippo_torch.py — cross-framework consistency: the matrix matches the closed form and the reconstruction errors reproduce the JAX/Julia numbers.

To run from the repo root:

# JAX (uses the uv .venv with jax, numpy, scipy, matplotlib)
PYTHONPATH=. python companions/ch07/jax/hippo_matrix.py
PYTHONPATH=. python companions/ch07/jax/hippo_reconstruction.py

# Julia (stdlib only; instantiates instantly)
julia --project=companions/ch07/julia companions/ch07/julia/hippo_legendre.jl

# PyTorch (needs the .venv [torch] extra)
PYTHONPATH=. python companions/ch07/torch/hippo_operator.py

All figures emit to public/figures/ch07/.

Part II · SSM Core Week 8 Published

LTI SSMs: S4, S4D, S5 as discretized continuous systems

The S4 architecture as a discretized HiPPO LTI system — the convolution–recurrence duality and its FFT kernel, the diagonal-plus-low-rank structure behind S4's fast kernel, S4D's diagonal restriction with by-construction stability, and S5's parallel associative scan.

LTI SSMs: S4, S4D, S5 as discretized continuous systems

Chapter 8 — at a glance

Goal: take the HiPPO recurrence Chapter 7 ended on and build the S4 architecture Gu et al. (2022) on top of it. An S4 layer is a discretization (Chapter 4) of the HiPPO LTI system (Chapter 7) with a learnable timescale — the original paper used the bilinear transform; this chapter, like S5 and the Mamba line, uses the zero-order hold. We prove the convolution–recurrence duality that lets the same layer train as an FFT convolution and run as a scan, sketch the diagonal-plus-low-rank structure behind S4’s fast kernel, then specialize twice: S4D drops the low-rank term for a diagonal model whose stability is enforced by construction, and S5 keeps the diagonal but computes it with a parallel associative scan.

Reading time: ~50 minutes prose; 90+ minutes with the JAX and PyTorch companions.

Looking ahead (C1 pilot): S4, S4D, and S5 are all linear time-invariant — one fixed $(\statemat, \inputmat, \outputmat)$ . The symplectic-integrator pilot’s empirical anchor is the selective, complex-state successor, Mamba-3 (Chapter 10); this chapter is the structural prerequisite that pilot reasons backward from, not itself a direct-transfer target.

Key insight: an LTI SSM has three computational faces of one operator — a recurrent scan ( $O(L)$ sequential, for inference), a convolution ( $O(L\log L)$ via FFT, for training), and a parallel associative scan ( $O(\log L)$ depth). S4 exploits the first two, S5 the third. The single thing that breaks this freedom — making the parameters depend on the input — is exactly what Chapter 9 does.

8.1 State-space to discrete: the S4 setup

Chapter 7 closed on a recurrence and called it the recurrent view of an S4-style layer. Make the layer explicit. S4 fixes a single-input, single-output (SISO) continuous system

\ddt \statevec(t) = \statemat\, \statevec(t) + \inputmat\, u(t), \qquad y(t) = \outputmat\, \statevec(t) + D\, u(t),

with $\statemat \in \R^{N\times N}$ the HiPPO-LegS matrix of §7.3, $\inputmat \in \R^{N\times 1}$ , $\outputmat \in \R^{1\times N}$ , and $D$ a scalar feedthrough. To run it on a sampled sequence $u_0, u_1, \dots$ we discretize at step $\stepsize$ . The S4 paper itself uses Chapter 4’s bilinear transform (§4.4), $\discA = (I - \tfrac{\stepsize}{2}\statemat)^{-1}(I + \tfrac{\stepsize}{2}\statemat)$ Gu et al. (2022) ; this book standardizes on the zero-order hold (§4.3) — the rule S5 Smith et al. (2023) and the Mamba line Gu & Dao (2024) later adopted, and one whose difference from bilinear the S4D ablations measure as negligible Gu et al. (2022) :

\discA = e^{\statemat \stepsize}, \qquad \discB = \statemat^{-1}\!\left(e^{\statemat\stepsize} - I\right)\inputmat, \qquad \statevec_{k+1} = \discA\, \statevec_k + \discB\, u_k, \quad y_k = \outputmat\, \statevec_{k+1} + D\, u_k,

computed through the augmented matrix-exponential trick of §4.3 (the companion’s discretize_zoh stacks $\big[\begin{smallmatrix}\statemat\stepsize & \inputmat\stepsize\\ 0 & 0\end{smallmatrix}\big]$ and exponentiates, sidestepping $\statemat^{-1}$ ).

What makes this an architecture rather than a fixed filter is which parts train. S4 learns $\outputmat$ , $D$ , and $\log\stepsize$ (the timescale, kept positive by the log), while $\statemat, \inputmat$ are initialized from HiPPO and often frozen or updated slowly. Stability is inherited for free: by the §7.7 spectrum every eigenvalue of $\statemat$ has negative real part, so Chapter 4’s ZOH proposition gives $\abs{\lambda(\discA)} = e^{\Re(\lambda(\statemat))\stepsize} < 1$ for every $\stepsize > 0$ . The slowest mode (eigenvalue $-1$ ) decays per step by $e^{-\stepsize}$ — for the companion’s default $\stepsize = 0.1$ that is $0.905$ , a long but strictly contracting memory.

This is one way to compute the layer. The rest of the chapter is, in large part, other ways to compute the same thing — and the surprising fact that they are equal.

8.2 The structured state matrix, revisited

Before the second view, recall why the HiPPO $\statemat$ is special enough to deserve fast algorithms. It is dense and lower-triangular (§7.3), so a black-box $e^{\statemat\stepsize}$ and a black-box kernel would cost more than the structure warrants. The structure S4 exploits is that the HiPPO-LegS matrix is normal-plus-low-rank (NPLR): up to an orthogonal change of basis it is a normal matrix plus a rank-one correction, and after diagonalizing the normal part it becomes diagonal-plus-low-rank (DPLR),

\statemat = \Lambda - P\,Q^{*}, \qquad \Lambda = \diag(\lambda_1, \dots, \lambda_N)\in\C^{N\times N}, \quad P, Q \in \C^{N\times 1},

with $\Lambda$ the (complex) diagonal of eigenvalues and $P Q^*$ a rank-one term Gu et al. (2022) . That the HiPPO-LegS matrix admits this diagonal-plus-rank-one form is not evident from the §7.3 closed form — it is a construction of the S4 paper, which we take as given rather than re-derive here. This is the same non-normality that §7.5 flagged as a conditioning hazard, now read constructively: the eigenvector basis is oblique (hence ill-conditioned), but the correction that makes $\statemat$ non-normal is only rank one, and a rank-one correction is exactly what the Woodbury identity handles cheaply.

The DPLR form is the hinge for the rest of the chapter. Section 8.4 uses the full $\Lambda - PQ^*$ to compute S4’s kernel quickly; §8.5 throws away the $PQ^*$ term entirely and keeps only $\Lambda$ , which is S4D.

8.3 The convolution–recurrence duality

The recurrence of §8.1 is sequential: $\statevec_{k+1}$ needs $\statevec_k$ . Unrolling it from zero initial state exposes a second, fully parallel description. With $\statevec_0 = 0$ ,

\statevec_{k+1} = \sum_{j=0}^{k} \discA^{\,k-j}\, \discB\, u_j \quad\Longrightarrow\quad y_k = \sum_{j=0}^{k} \big(\outputmat\, \discA^{\,k-j}\, \discB\big)\, u_j + D\, u_k.

Define the SSM convolution kernel $K \in \R^{L}$ by $K_i = \outputmat\, \discA^{\,i}\, \discB$ . Then the output is a causal convolution plus the feedthrough:

y = K * u + D\, u.

Proposition 8.1.

(Convolution–recurrence duality.) For zero initial state, the SISO recurrence $\statevec_{k+1} = \discA\statevec_k + \discB u_k$ , $\,y_k = \outputmat\statevec_{k+1} + D u_k$ produces exactly the convolution $y = K * u + D u$ with kernel $K_i = \outputmat\, \discA^{\,i}\, \discB$ , $\,i = 0, \dots, L-1$ . The recurrent form costs $O(L)$ sequential steps; the convolution, computed by FFT, costs $O(L\log L)$ parallel work.

The proof (Exercise 8.4, §8.9) is the unrolling above made rigorous by induction. The two forms are not approximations of each other — they are the same numbers. The companion confirms it to machine precision: the recurrent lax.scan output and the FFT convolution agree to a measured residual $\sim 10^{-15}$ in float64, which test_conv_recurrence_duality pins below $10^{-12}$ .

Left: the SSM convolution kernel as a smooth decaying sequence of taps. Right: the recurrent-scan output and the FFT-convolution output plotted on top of each other, indistinguishable, with the maximum residual annotated below 1e-15. — The convolution–recurrence duality for an $N=16$ HiPPO-S4 system. Left: the kernel $K_i = \outputmat\,\discA^{\,i}\,\discB$, a decaying tap sequence. Right: the recurrent-scan output (solid) and the FFT-convolution output (dashed) coincide to machine precision (measured residual $\sim 10^{-15}$, pinned below $10^{-12}$). Produced by companions/ch08/jax/s4_duality.py; pinned by test_conv_recurrence_duality in companions/ch08/jax/tests/test_s4.py.

The duality is why S4 is practical: train as a convolution, deploy as a recurrence. Training sees whole sequences at once, so the parallel $O(L\log L)$ FFT wins; autoregressive inference produces one token at a time, so the $O(L)$ scan with its $O(N)$ state wins. One set of parameters, two runtimes.

8.4 The Cauchy kernel: structure buys speed

The duality leaves one cost open: building the kernel $K$ . Naively, $K_i = \outputmat\,\discA^{\,i}\,\discB$ iterates the state vector $\discA^{\,i}\discB$ — one $N\times N$ mat-vec per tap, $O(N^2 L)$ in total. For the teaching regime $N \le 64$ that is exactly what the companion does, and it is plenty fast. S4’s published contribution is to compute the same kernel in near-linear time, $\tilde O(N + L)$ , by using the DPLR structure of §8.2 Gu et al. (2022) .

The route is a generating function. Collect the taps into a polynomial and evaluate it on the unit circle: the truncated generating function is

\hat K(z) = \sum_{i=0}^{L-1} \outputmat\,\discA^{\,i}\,\discB\; z^{i} \;\approx\; \outputmat\,(I - \discA\, z)^{-1}\,\discB,

a geometric series in the matrix $\discA z$ . Recovering $K$ from $\hat K$ evaluated at the $L$ -th roots of unity is one inverse FFT. The remaining cost is the resolvent $(I - \discA z)^{-1}$ , and here DPLR pays off: with $\discA$ built from $\Lambda - PQ^*$ the Woodbury identity turns the resolvent into a Cauchy matrix–vector product — entries $1/(\zeta_k - \lambda_n)$ — plus a rank-one correction. The whole kernel is then a Cauchy/Woodbury solve in $N$ plus the inverse FFT in $L$ (the §8.4 figure isolates the $N$ -scaling at fixed $L$ ); the dense $\discA^{\,i}$ never appears.

We present this kernel but do not implement it: the companions use the transparent naive kernel throughout (§8.10). The point of §8.4 is the shape of the cost, not the engineering — and the shape is what motivates S4D.

Log-log plot of kernel computation time versus state dimension N. The measured naive-kernel times track a slope-2 reference line; a much shallower N-log-squared reference line sits below it. — Why S4 needs structure. Measured wall-clock of the naive $O(N^2 L)$ kernel ($L=512$) tracks the $N^2$ reference; an $O(N\log^2 N)$ reference slope (analytic, not a timed implementation — the Cauchy kernel of §8.4 is not built here) sits asymptotically far below. For $N \le 64$ the naive kernel is more than adequate and far clearer. Produced by companions/ch08/jax/s4_kernel_cost.py.

8.5 S4D: the diagonal restriction

The Cauchy machinery is the price of keeping the rank-one term $PQ^*$ . S4D asks what happens if we simply drop it Gu et al. (2022) — keep the diagonal $\Lambda$ , discard the low-rank correction. The state matrix is now genuinely diagonal, $\statemat = \Lambda = \diag(\lambda_1, \dots, \lambda_N)$ , and everything decouples mode by mode. The kernel collapses from a Cauchy computation to a Vandermonde sum:

K_l = \sum_{n} \outputmat_n\, \bar B_n\, \bar\Lambda_n^{\,l}, \qquad \bar\Lambda_n = e^{\lambda_n \stepsize}, \quad \bar B_n = \frac{\bar\Lambda_n - 1}{\lambda_n},

with $\bar B_n$ the per-mode diagonal ZOH input (this book’s convention; the S4D paper itself treats the integration rule as interchangeable — its base version keeps S4’s bilinear — and measures no noticeable difference between the two). In practice the $N$ real states are carried as $N/2$ complex conjugate pairs and the real kernel is reconstructed as $K_l = 2\,\Re\!\big(\sum_n \outputmat_n \bar B_n \bar\Lambda_n^{\,l}\big)$ .

Proposition 8.2.

(S4D Vandermonde kernel and stability.) For diagonal $\statemat = \diag(\lambda_n)$ the kernel is the Vandermonde sum $K_l = \sum_n \outputmat_n \bar B_n \bar\Lambda_n^{\,l}$ with $\bar\Lambda_n = e^{\lambda_n\stepsize}$ . The discrete system is stable — every $\abs{\bar\Lambda_n} < 1$ — if and only if $\Re(\lambda_n) < 0$ for all $n$ .

The derivation and the stability claim are Exercise 8.5 (§8.9). What makes S4D more than “S4 with worse expressivity” is how it secures the stability condition. The S4D-Lin initialization places the modes at $\lambda_n = -\tfrac12 + i\pi n$ , and — crucially — stores them in the parameterization

\lambda_n = -e^{(\,\log A_{\text{real}})_n} + i\,(A_{\text{imag}})_n,

so that $\Re(\lambda_n) = -e^{(\log A_{\text{real}})_n} < 0$ for any real value the parameters take. Stability is not hoped for; it is structural.

This is the constructive answer to the §7.5 conditioning hazard: where S4 must diagonalize an ill-conditioned non-normal matrix and hope the basis behaves, S4D never forms that basis at all — it parameterizes the eigenvalues directly and pins their real parts negative.

[Note] The sign trap, defused. Chapter 7’s companion hit a sign bug because the HiPPO routine returns the stable

\statemat

(eigenvalues

-1,\dots,-N

), and feeding the positive-eigenvalue variant to ZOH gives

\abs{\discA} > 1

and an exploding kernel. S4D makes that bug unrepresentable:

-e^{(\cdot)}

is negative for every input, so no setting of the weights can push a mode out of the unit disk. The companion’s test_s4d_real_part_negative_for_any_parameters checks exactly this across random extreme parameters.

Diagonal state space models are not a downgrade in practice: with the right initialization they match full S4 on long-range benchmarks while being markedly simpler to implement Gupta et al. (2022) . The companion plots the S4D-Lin spectrum against the full HiPPO spectrum to make the restriction visible.

Complex plane scatter. Gray points: the HiPPO-LegS eigenvalues spread along the negative real axis from minus 1 to minus 32. Navy points: the S4D-Lin modes at real part minus one-half, stepping up the imaginary axis at multiples of pi. — S4D-Lin diagonal initialization ($M=16$ complex modes at $-\tfrac12 + i\pi n$, navy) against the full HiPPO-LegS spectrum ($N=32$ real eigenvalues $-1,\dots,-N$, gray). S4D abandons HiPPO's real integer spectrum for a fixed-real-part comb up the imaginary axis; the common feature is $\Re(\lambda) < 0$, here enforced by construction. Produced by companions/ch08/jax/s4d_kernel.py.

8.6 S5: the parallel associative scan

S4D made the state diagonal but still computed it as a convolution. S5 keeps the diagonal state and abandons the convolution for a parallel scan Smith et al. (2023) . Two changes. First, S5 is MIMO: one shared diagonal state of size $P$ drives $H$ input/output channels through $\inputmat \in \C^{P\times H}$ and $\outputmat \in \C^{H\times P}$ , instead of $H$ independent SISO systems. Second, it computes the recurrence

\statevec_k = \discA \odot \statevec_{k-1} + \discB\, u_k

(diagonal $\discA$ , so $\odot$ is elementwise) with an associative scan. The trick is that a linear recurrence is an associative operation on $(\text{transition}, \text{input})$ pairs:

(a_1, b_1) \bullet (a_2, b_2) = (a_2 \odot a_1,\; a_2 \odot b_1 + b_2),

which composes “apply the first affine update, then the second.” Scanning $\bullet$ over the pairs $(\discA, \discB u_k)$ yields every state $\statevec_k$ at once.

Proposition 8.3.

(Scan equivalence.) The operator $\bullet$ above is associative, and the inclusive prefix scan of $(\discA, \discB u_0), \dots, (\discA, \discB u_{L-1})$ under $\bullet$ has as its second components exactly the sequential states $\statevec_0, \dots, \statevec_{L-1}$ . A balanced reduction computes all prefixes in parallel depth $O(\log L)$ Blelloch (1990) , versus the $O(L)$ depth of the sequential recurrence.

The proof is Exercise 8.6 (§8.9). The companion pins the equivalence to machine precision — the associative scan and the sequential scan agree to about $3\times10^{-15}$ (pinned below $10^{-12}$ ) — and contrasts the two on the algorithmic cost that actually differs: critical-path depth.

Two panels. Left: log-log plot of parallel steps versus sequence length; the sequential curve rises linearly as L while the parallel curve rises as log L. Right: semilog plot of the maximum state difference between the two methods versus L, flat near 1e-15. — S5's parallel scan vs the sequential recurrence. Left: critical-path depth — sequential $O(L)$ against parallel $\lceil\log_2 L\rceil$. Right: the two compute identical states, agreeing to $\sim 10^{-15}$ (pinned below $10^{-12}$) across sequence lengths. The depth advantage is realized on parallel hardware (GPU/TPU). Produced by companions/ch08/jax/s5_scan.py; pinned by test_s5_parallel_equals_sequential_states.

[Note] No free scan in PyTorch. JAX has lax.associative_scan; PyTorch has no native parallel-scan primitive, so the torch companion (s5_sequential.py) runs the recurrence as an eager loop — correct, and bit-for-bit identical to the JAX states, but sequential. The

O(\log L)

depth is a JAX/parallel-hardware story; in production a custom CUDA or Triton kernel supplies the scan.

S5 deliberately stops at one diagonal SSM run in parallel. Its MIMO channel mixing and, above all, the move to input-dependent transitions are Chapter 9’s subject — and that move is what makes the associative scan indispensable rather than merely convenient.

8.7 What’s next

The thread of this chapter is that S4, S4D, and S5 are three computational treatments of one linear time-invariant system: a convolution and a recurrence that the duality proves equal (§8.3), a diagonal restriction whose stability is free (§8.5), and a parallel scan that reorders the same arithmetic into $O(\log L)$ depth (§8.6). Time-invariance is what every one of these relies on — a fixed $\discA$ is what lets a single kernel exist and a single transition seed the scan.

Chapter 9 breaks time-invariance on purpose: Mamba makes $\statemat, \inputmat, \outputmat$ functions of the input Gu & Dao (2024) (and the structured state-space duality Dao & Gu (2024) later reframes the result), which destroys the convolutional view and leaves the selective scan — the §8.6 primitive, now load-bearing — as the only way to compute the layer. Chapter 10 then replaces ZOH with the exponential-trapezoidal scheme of Chapter 4 when the selective dynamics turn stiff, in Mamba-3 Lahoti et al. (2026) ; that chapter is where the symplectic-integrator pilot finds its empirical anchor. Everything downstream inherits the discretized-LTI scaffold built here.

8.8 Exercises

Six problems mixing computation and theory. Short/numerical (8.1–8.3) have inline collapsible solutions; the theory exercises (8.4–8.6) have full worked solutions in §8.9.

Exercise 8.1 (computation)

Take a diagonal toy system $\statemat = \diag(-1, -2)$ , $\inputmat = (1, 1)^\top$ , $\outputmat = (1, 1)$ , $D = 0$ , and step $\stepsize = 0.1$ . Compute $\discA$ and $\discB$ by the diagonal ZOH, then the first four kernel taps $K_0, \dots, K_3$ . Confirm the taps decay.

Solution

Diagonal ZOH acts mode-by-mode: $\bar A_n = e^{\lambda_n \stepsize}$ and $\bar B_n = (\bar A_n - 1)/\lambda_n \cdot B_n$ . So $\bar A = (e^{-0.1}, e^{-0.2}) = (0.9048, 0.8187)$ , and $\bar B_1 = (1 - e^{-0.1})/1 = 0.0952$ , $\bar B_2 = (1 - e^{-0.2})/2 = 0.0906$ . With $\outputmat = (1,1)$ the taps are $K_k = \sum_n \bar A_n^{\,k}\, \bar B_n = 0.0952\,(0.9048)^k + 0.0906\,(0.8187)^k$ :

K_0 = 0.1858,\quad K_1 = 0.1603,\quad K_2 = 0.1387,\quad K_3 = 0.1202.

Each tap is a sum of two decaying geometric sequences (rates $e^{-0.1}$ and $e^{-0.2}$ ), so the kernel decays monotonically — a discrete sum of exponentials, one per mode.

Exercise 8.2 (computation + code)

The S4D kernel sums one geometric sequence per mode. (a) Show that $\abs{\bar\Lambda_n^{\,l}} = e^{\Re(\lambda_n)\stepsize\, l}$ , so each mode decays at its own rate. (b) Show that a complex mode $\lambda$ with weight $w = \outputmat\bar B$ and its conjugate $\bar\lambda$ with weight $\bar w$ together contribute $w\,\bar\Lambda^{\,l} + \bar w\,\overline{\bar\Lambda}^{\,l} = 2\Re(w\,\bar\Lambda^{\,l})$ , explaining the real $2\Re$ form. (c) Run companions/ch08/jax/s4d_kernel.py and confirm the Vandermonde kernel matches the recurrence oracle.

Solution

(a) $\bar\Lambda_n = e^{\lambda_n\stepsize}$ , so $\bar\Lambda_n^{\,l} = e^{\lambda_n\stepsize l}$ and $\abs{\bar\Lambda_n^{\,l}} = e^{\Re(\lambda_n)\stepsize l}$ ; with $\Re(\lambda_n) < 0$ this decays geometrically at rate $e^{\Re(\lambda_n)\stepsize}$ per tap. (b) For any complex $\zeta$ , $w\zeta + \bar w\bar\zeta = 2\Re(w\zeta)$ since a number plus its conjugate is twice its real part; applying it with $\zeta = \bar\Lambda^{\,l}$ (and noting $\overline{\bar\Lambda}^{\,l} = \overline{\bar\Lambda^{\,l}}$ ) gives the claim. Carrying $N/2$ conjugate pairs and taking $2\Re$ thus reproduces the $N$ -real-state kernel while storing only half as many complex numbers. (c) test_s4d_kernel_matches_recurrence_oracle pins the closed-form Vandermonde kernel against an independent complex-diagonal recurrence impulse response to $10^{-9}$ .

Exercise 8.3 (code)

Run companions/ch08/jax/s4_duality.py and confirm the recurrent and convolutional outputs agree (the companion reports a residual below $10^{-15}$ ). Then explain why the agreement would break if the FFT used length $L$ instead of $2L$ .

Solution

The FFT computes a circular convolution of length equal to the transform size. With a length- $L$ transform, output position $0$ receives a contribution from kernel tap $L-1$ wrapped around the end of the sequence — a spurious term that the causal linear convolution does not contain, corrupting the early outputs. Zero-padding both signals to $2L$ before the FFT makes the circular convolution agree with the causal linear one on the first $L$ positions (the wrapped terms land in the padded tail, which is discarded). The companion’s causal_conv_fft uses n = 2*L for exactly this reason; test_fft_pad_matches_direct_convolution pins the padded FFT against a direct convolution.

Exercise 8.4 (theory) — solution in §8.9

Prove the convolution–recurrence duality (Proposition, §8.3): for zero initial state, the SISO recurrence equals $y = K * u + D u$ with $K_i = \outputmat\,\discA^{\,i}\,\discB$ .

Exercise 8.5 (theory) — solution in §8.9

Derive the S4D Vandermonde kernel from the diagonal recurrence, prove the discrete system is stable iff $\Re(\lambda_n) < 0$ for all $n$ , and show that the S4D-Lin parameterization $\lambda_n = -e^{(\log A_{\text{real}})_n} + i(A_{\text{imag}})_n$ enforces stability for every parameter value.

Exercise 8.6 (theory) — solution in §8.9

Prove that the scan operator $(a_1,b_1)\bullet(a_2,b_2) = (a_2 a_1,\, a_2 b_1 + b_2)$ is associative, that its inclusive prefix scan reproduces the sequential states $\statevec_k$ , and that a balanced reduction achieves parallel depth $O(\log L)$ .

8.9 Full solutions to theory exercises

Solution to Exercise 8.4

Start from $\statevec_0 = 0$ and $\statevec_{k+1} = \discA\statevec_k + \discB u_k$ . Claim: $\statevec_{k+1} = \sum_{j=0}^{k} \discA^{\,k-j}\discB\, u_j$ . Induct on $k$ . Base ( $k=0$ ): $\statevec_1 = \discA\cdot 0 + \discB u_0 = \discA^{\,0}\discB\, u_0$ . Step: assuming $\statevec_k = \sum_{j=0}^{k-1}\discA^{\,k-1-j}\discB u_j$ ,

\statevec_{k+1} = \discA\statevec_k + \discB u_k = \sum_{j=0}^{k-1}\discA^{\,k-j}\discB u_j + \discA^{\,0}\discB u_k = \sum_{j=0}^{k}\discA^{\,k-j}\discB u_j.

Now the output, using $y_k = \outputmat\statevec_{k+1} + D u_k$ :

y_k = \sum_{j=0}^{k}\outputmat\discA^{\,k-j}\discB\, u_j + D u_k = \sum_{j=0}^{k} K_{k-j}\, u_j + D u_k,

with $K_i = \outputmat\discA^{\,i}\discB$ . The sum $\sum_{j=0}^k K_{k-j} u_j$ is precisely the causal convolution $(K * u)_k$ , so $y = K * u + D u$ . The lower-triangular Toeplitz structure of the map $u \mapsto y - Du$ is the convolution. $\blacksquare$

Solution to Exercise 8.5

For diagonal $\statemat = \diag(\lambda_n)$ the matrix exponential is diagonal, $\discA = \diag(e^{\lambda_n\stepsize})$ , and the ZOH input $\discB = \statemat^{-1}(e^{\statemat\stepsize} - I)\inputmat$ acts mode-by-mode: $\bar B_n = (e^{\lambda_n\stepsize} - 1)/\lambda_n \cdot B_n$ . The kernel inherits the diagonality,

K_l = \outputmat\,\discA^{\,l}\,\discB = \sum_n \outputmat_n\, e^{\lambda_n\stepsize l}\, \bar B_n = \sum_n \outputmat_n\,\bar B_n\,\bar\Lambda_n^{\,l},

a sum of geometric sequences — the Vandermonde form (the real $2\Re$ version follows by pairing conjugates, Exercise 8.2). Stability: the kernel and the state stay bounded iff each geometric ratio satisfies $\abs{\bar\Lambda_n} < 1$ . Since $\abs{\bar\Lambda_n} = \abs{e^{\lambda_n\stepsize}} = e^{\Re(\lambda_n)\stepsize}$ and $\stepsize > 0$ , this holds iff $\Re(\lambda_n) < 0$ for every $n$ . Finally, under S4D-Lin the real part is $\Re(\lambda_n) = -e^{(\log A_{\text{real}})_n}$ , and a negative exponential is strictly negative for every real exponent — so $\Re(\lambda_n) < 0$ holds for all parameter values, with no constraint to enforce during training. Stability is built into the parameterization rather than maintained by it. $\blacksquare$

Solution to Exercise 8.6

Associativity. Write the action of a pair $(a, b)$ on a state $h$ as the affine map $h \mapsto a h + b$ . Composition of affine maps is associative because function composition is. Concretely, with $\bullet$ elementwise,

\big((a_1,b_1)\bullet(a_2,b_2)\big)\bullet(a_3,b_3) = (a_2 a_1,\, a_2 b_1 + b_2)\bullet(a_3,b_3) = (a_3 a_2 a_1,\; a_3 a_2 b_1 + a_3 b_2 + b_3),

(a_1,b_1)\bullet\big((a_2,b_2)\bullet(a_3,b_3)\big) = (a_1,b_1)\bullet(a_3 a_2,\, a_3 b_2 + b_3) = (a_3 a_2 a_1,\; a_3 a_2 b_1 + a_3 b_2 + b_3),

and the two right-hand sides are identical. Prefix scan reproduces the states. Let $e_k = (\discA, \discB u_k)$ and $s_k = e_0 \bullet \cdots \bullet e_k$ be the inclusive scan. By induction the second component of $s_k$ equals $\statevec_k$ : for $k = 0$ it is $\discB u_0 = \statevec_0$ (with $\statevec_{-1}=0$ ); and $s_k = s_{k-1} \bullet e_k$ has second component $\discA\cdot(\text{2nd comp of } s_{k-1}) + \discB u_k = \discA\statevec_{k-1} + \discB u_k = \statevec_k$ . Depth. Because $\bullet$ is associative, computing all prefixes is a prefix-sum problem, which a balanced binary reduction tree evaluates in $O(\log L)$ sequential rounds with $O(L)$ total work Blelloch (1990) — versus the $O(L)$ rounds of the naive recurrence. $\blacksquare$

8.10 Companion code

Two language tracks for Chapter 8 — JAX (the reference, with the parallel scan) and PyTorch (the nn.Module operators). Julia is omitted here: the S4D reference and the associative-scan primitive both live in the JAX/PyTorch world, and a stdlib-Julia rebuild would add a third spelling without a third lesson. The JAX and torch companions are pinned bit-for-bit against each other, the cross-framework lesson Chapter 7 introduced.

JAX (companions/ch08/jax/):

s4_core.py — the harness-free S4 numerical core ported from the Week 4 source: make_hippo_legs, discretize_zoh / discretize_bilinear, the naive $O(N^2 L)$ kernel, and the recurrent and convolutional views. The §8.3 duality lives here.
s4_duality.py — emits kernel-duality.png (§8.3): kernel taps plus the two views overlaid.
s4_kernel_cost.py — times the naive kernel across $N$ ; emits kernel-vs-n.png (§8.4).
s4d_kernel.py — the S4D-Lin diagonal init and Vandermonde kernel (§8.5), complex128; emits s4d-spectrum.png.
s5_scan.py — the S5 associative scan via jax.lax.associative_scan, with the sequential reference; emits s5-scan-depth.png (§8.6).
tests/ — test_s4.py pins the duality (recurrent $\equiv$ convolutional), the kernel oracle, ZOH stability, and the FFT-pad guard; test_s4d_s5.py pins S4D by-construction stability, the Vandermonde-vs-recurrence cross-check, and S5 scan-equivalence.

PyTorch (companions/ch08/torch/):

s4d_kernel.py — the S4D kernel as an nn.Module (fixed-init buffers), the Vandermonde computation ported from the reference S4D, in complex128.
s5_sequential.py — the diagonal MIMO S5 as a sequential scan (PyTorch has no native parallel scan); the states match the JAX associative scan bit-for-bit.
tests/test_s4d_torch.py — cross-framework parity: the torch S4D kernel and S5 states equal their JAX counterparts to within $10^{-8}$ .

To run from the repo root:

# JAX (uses the uv .venv with jax, numpy, matplotlib)
PYTHONPATH=. python companions/ch08/jax/s4_duality.py
PYTHONPATH=. python companions/ch08/jax/s4_kernel_cost.py
PYTHONPATH=. python companions/ch08/jax/s4d_kernel.py
PYTHONPATH=. python companions/ch08/jax/s5_scan.py

# PyTorch (needs the .venv [torch] extra)
PYTHONPATH=. python companions/ch08/torch/s4d_kernel.py
PYTHONPATH=. python companions/ch08/torch/s5_sequential.py

All figures emit to public/figures/ch08/.

Part II · SSM Core Week 9 Published

Selective SSMs: Mamba-1, Mamba-2, and the SSD framework

Making the SSM parameters input-dependent — selectivity breaks linear time-invariance, collapses the convolution–recurrence duality, and leaves the associative scan as the only route; the structured state-space duality then re-reads the selective scan as a semiseparable matrix and, for scalar state, as masked linear attention.

Selective SSMs: Mamba-1, Mamba-2, and the SSD framework

Chapter 9 — at a glance

Goal: take the linear time-invariant SSM Chapter 8 built and make its parameters depend on the input. Mamba Gu & Dao (2024) lets $\stepsize$ , $\inputmat$ , and $\outputmat$ be functions of the current token, keeping only the diagonal $\statemat$ fixed. We show this single change is destructive in a precise way: it breaks the time-invariance the §8.3 convolution–recurrence duality relied on, so the convolutional view collapses and the §8.6 associative scan — now fed input-dependent transitions — is the parallel primitive that remains. We then read the selective scan as a matrix: the structured state-space duality (SSD) of Mamba-2 Dao & Gu (2024) identifies the layer with a semiseparable matrix multiply, and for scalar state with masked linear attention.

Reading time: ~55 minutes prose; 90+ minutes with the JAX and PyTorch companions.

Looking ahead (C1 pilot): Chapter 8 noted that the symplectic-integrator pilot’s empirical anchor is the selective, complex-state successor Mamba-3 (Chapter 10). This chapter is the structural step that introduces selectivity in the real-diagonal setting; Chapter 10 adds the complex state and the exponential-trapezoidal discretization (Chapter 4) the pilot scrutinizes. Selectivity is the property Mamba-3 inherits and the pilot stress-tests.

Key insight: the LTI layer of §8.7 had three faces of one operator — a recurrence, a convolution, and a parallel scan. Making the parameters input-dependent destroys the convolutional face — a kernel exists only when the dynamics are time-invariant — leaving the recurrence and its parallel associative scan as how the layer is computed. Re-expressing that scan as a matrix multiply then adds a second parallel route and shows the selective SSM is a form of attention — the bridge to Chapter 11.

9.1 Breaking time-invariance: input-dependent $(\stepsize, \inputmat, \outputmat)$

Chapter 8 ended on a layer with three computational faces and one weakness it never named: an LTI SSM applies the same dynamics to every token. The discrete map $\statevec_{k+1} = \discA\,\statevec_k + \discB\,u_k$ filters all inputs through one fixed kernel, so the layer cannot decide, based on content, what to keep and what to discard. On synthetic tasks that demand exactly that — selective copying, where only some tokens matter, or induction, where the model must latch onto a token seen once — a fixed kernel provably cannot route information by content Gu & Dao (2024) .

Mamba’s remedy is one structural change: make the parameters functions of the input. At step $t$ ,

\stepsize_t = \mathrm{softplus}\!\big(\omega + s_\Delta(u_t)\big), \qquad \inputmat_t = s_B(u_t), \qquad \outputmat_t = s_C(u_t),

with $s_\Delta, s_B, s_C$ linear projections and $\omega$ a bias; only the diagonal state matrix $\statemat = -e^{\,\log \statemat}$ stays a fixed parameter (kept stable by the §8.5 sign-by-construction trick). Discretizing per step — Mamba uses the simplified form $\discA_t = e^{\stepsize_t \statemat}$ , $\discB_t = \stepsize_t \inputmat_t$ — gives a recurrence whose transition $\discA_t$ now changes at every step:

\statevec_t = \discA_t \odot \statevec_{t-1} + \discB_t\, u_t, \qquad y_t = \outputmat_t \cdot \statevec_t + D\, u_t .

[Note] Stability survives selectivity for free. With

\statemat = -e^{(\cdot)} < 0

and

\stepsize_t = \mathrm{softplus}(\cdot) > 0

, every

\discA_t = e^{\stepsize_t \statemat}

satisfies

\abs{\discA_t} \le 1

for any parameter value — strictly

< 1

except in the hold limit

\stepsize_t \to 0

, where the gate closes and

\discA_t \to 1

(the held state of §9.2). This is the §8.5 “sign trap, defused”, carried into the time-varying setting; the companion’s test_selective_stable_for_any_parameters checks

\Re(\statemat) < 0

and

\abs{\discA_t} \le 1

across extreme random parameters.

The cost of this freedom is the subject of the rest of the chapter, and it is exactly the structure Chapter 8 spent its length building. A single fixed $\discA$ is what let one kernel $K_i = \outputmat\,\discA^{\,i}\,\discB$ exist. Once $\discA_t$ varies, that kernel is gone.

Proposition 9.1.

(LTI collapse.) If the discrete transition is input-dependent — $\discA_t \neq \discA_{t'}$ for some $t, t'$ — then the map $u \mapsto y - D u$ is causal and linear, hence lower-triangular, but not Toeplitz: there is no single kernel $K$ with $y = K * u + D u$ . The convolutional view of §8.3 exists if and only if the system is time-invariant.

The proof (Exercise 9.4, §9.9) is short: a causal linear map is a convolution exactly when its matrix has constant diagonals, and input-dependence makes even the main diagonal vary. The duality of §8.3 was a property of time-invariance all along; selectivity spends it.

9.2 Selective SSMs as discretized linear time-varying systems

The dynamical-systems name for what just happened is familiar. Chapter 8’s LTI SSM is the discretization of an autonomous linear system $\dot{\statevec} = \statemat\,\statevec + \inputmat\,u$ , whose coefficients do not depend on $t$ . A selective SSM is the discretization of a linear time-varying (LTV), or non-autonomous, system $\dot{\statevec}(t) = \statemat(t)\,\statevec(t) + \inputmat(t)\,u(t)$ . The whole apparatus of §8.3 — a single $\discA$ , its powers $\discA^{\,i}$ , one kernel — assumed autonomy. LTV systems replace the matrix power with a matrix product.

Definition 9.2.

(Discrete state-transition operator.) For the LTV recurrence $\statevec_t = \discA_t \statevec_{t-1} + \discB_t u_t$ with $\statevec_0 = 0$ , the state is

\statevec_k = \sum_{j=1}^{k} \Phi(k, j)\,\discB_j\, u_j, \qquad \Phi(k, j) = \discA_k \discA_{k-1} \cdots \discA_{j+1} \;\;(\Phi(k,k) = I),

where $\Phi(k, j)$ is the state-transition operator carrying state from step $j$ to step $k$ . The input $\to$ output map therefore has $(k, j)$ entry $\outputmat_k\,\Phi(k, j)\,\discB_j$ — a path-ordered product of transitions, not the power $\discA^{\,k-j}$ of the LTI case.

This is why no kernel exists: the tap from source $j$ to target $k$ depends on the entire path of step sizes between them, $\Phi(k,j)$ , and not on the offset $k - j$ . For a diagonal $\statemat$ the product is elementwise and the per-mode transition is a clean exponential of an accumulated step, $\Phi(k,j)_n = \exp\!\big(\statemat_n \sum_{i=j+1}^{k} \stepsize_i\big)$ — the companion’s segsum computes exactly this cumulative decay. Reading the dynamics through the transition operator is the right altitude: it is what survives into §9.5 as the entries of a structured matrix, and it is what makes the LTV system non-Toeplitz but still cheaply structured.

Selectivity is visible already in a single mode. Drive one stable mode with a content impulse and contrast a fixed- $\stepsize$ LTI system against a selective one whose gate $\discA_t = e^{\stepsize_t \statemat}$ opens ( $\stepsize_t$ large, $\discA_t \to 0$ : overwrite) on the content token and closes ( $\stepsize_t \to 0$ , $\discA_t \to 1$ : hold) elsewhere. The selective system writes once and holds; the LTI system, forced to apply the same decay everywhere, forgets.

Two panels. Left: an input impulse at step 5 and a selective step-size that spikes there and is near zero elsewhere. Right: the selective state holds nearly flat after the impulse while the LTI state decays rapidly toward zero. — Selectivity as a content-dependent gate, single mode $\statemat = -1$. Left: the input impulse and the selective $\stepsize_t$ (large to write, near-zero to hold). Right: with the gate held, the selective state retains 84.4% of the written value 34 steps later; the fixed-$\stepsize$ LTI system, applying one decay everywhere, retains 0.004%. Produced by companions/ch09/jax/selective_vs_lti.py.

9.3 The selective scan: the §8.6 primitive, now the only route

The convolution is gone (Proposition, §9.1), so the $O(L\log L)$ FFT path of training-time S4 has no analogue here. What does survive is the parallel scan of §8.6 — and it survives because it never needed time-invariance in the first place. Recall the associative operator on $(\text{transition}, \text{input})$ pairs,

(a_1, b_1) \bullet (a_2, b_2) = (a_2 \odot a_1,\; a_2 \odot b_1 + b_2),

which composes “apply the first affine update, then the second.” Chapter 8 fed it the same $\discA$ at every step; nothing in the operator required that. Feeding it the time-varying pairs $(\discA_t,\; \discB_t u_t)$ scans the selective recurrence directly. This is the selective scan.

Proposition 9.3.

(Selective scan equivalence.) The operator $\bullet$ is associative for arbitrary first components $a_t$ (not necessarily equal). Its inclusive prefix scan over $(\discA_1, \discB_1 u_1), \dots, (\discA_L, \discB_L u_L)$ has as its second components exactly the LTV states $\statevec_1, \dots, \statevec_L$ , computed in parallel depth $O(\log L)$ Blelloch (1990) . Because no convolution exists (Proposition, §9.1), the $O(L\log L)$ FFT path of training-time S4 is gone; this scan recovers $O(\log L)$ depth at linear $O(L)$ work and is the primitive the layer is computed with. (Section 9.5 derives a second parallel route — a quadratic matrix multiply — from the same structure.)

The proof (Exercise 9.5, §9.9) is the §8.6 argument with one observation added: the associativity computation never used $a_1 = a_2$ , so the LTI assumption was load-bearing only for the convolution, never for the scan. The companion pins the equivalence to machine precision — the selective associative scan and the sequential recurrence agree to a measured residual $\sim 10^{-15}$ across sequence lengths, which test_selective_scan_equals_sequential holds below $10^{-12}$ .

Two panels. Left: the maximum difference between the selective associative scan and the sequential recurrence versus sequence length, flat near 1e-15. Right: log-log critical-path depth, the sequential curve rising as L and the selective-scan curve as log L. — The selective scan equals the sequential recurrence (left, residual $\sim 10^{-15}$, pinned below $10^{-12}$) while keeping the parallel critical-path depth $\lceil\log_2 L\rceil$ against the sequential $L$ (right). Unlike S4, there is no FFT-convolution fallback — the scan is the only parallel route. Produced by companions/ch09/jax/selective_scan_demo.py; pinned by test_selective_scan_equals_sequential.

[Note] This is the primitive §8.6 said would “outlive time-invariance.” Chapter 8 built S5’s associative scan when a convolution was still available as an alternative; here the alternative is gone, and the same code becomes the load-bearing computation. PyTorch still has no native parallel scan, so the torch companion (selective_ssm.py) runs the recurrence as an eager loop — bit-for-bit identical states, sequential depth; in production a custom kernel supplies the parallel scan (§9.4).

9.4 The hardware-aware fused kernel

Making the scan the only route raises the stakes on computing it efficiently, and the obstacle is memory, not arithmetic. The selective scan’s state is $\statevec_t \in \R^N$ per channel: for a batch of $B$ sequences of length $L$ with $D$ channels and state size $N$ , materializing every intermediate state — which the naive backward pass needs — costs $O(B L D N)$ , an $N\times$ blow-up over the $O(B L D)$ inputs and outputs. At $B = 8$ , $L = 2048$ , $D = 1024$ , $N = 16$ that is 1.1 GB of fp32 state against 0.067 GB of output (figure below) — at Mamba’s working dimensions, the difference between fitting in memory and not.

Two panels. Left: log-log memory in gigabytes versus state dimension N, the materialized line B·L·D·N rising linearly while the fused line B·L·D stays flat. Right: the ratio materialized-over-fused versus N, a straight line equal to N. — Why the selective scan needs a fused kernel. For $B=8$, $L=2048$, $D=1024$, materializing all states costs $B L D N$ — at $N=16$ that is 1.1 GB of fp32 against 0.067 GB for the $B L D$ outputs, a $16\times$ ($=N$) gap. Mamba avoids it by keeping the expanded state in SRAM and recomputing it in the backward pass. Produced by companions/ch09/jax/selective_scan_demo.py.

Mamba’s selective-scan kernel resolves this with three hardware-aware moves Gu & Dao (2024) . It fuses discretization, scan, and output projection into a single GPU kernel, so the expanded $(B, L, D, N)$ state is formed and consumed in fast on-chip SRAM and never written to slow HBM. It recomputes that state in the backward pass instead of storing it — the activation-recomputation trade that pays compute to save memory, the same jax.checkpoint discipline the Mamba-1 reference implementation applies to its selective-scan body. And it exploits the fact, from §9.3, that there is no convolution to fall back on: the scan must be fast, so the engineering goes into the scan rather than into an alternative path. The companion does not implement a CUDA kernel — it makes the memory accounting concrete (the figure above), which is the part that explains why the kernel exists.

9.5 Mamba-2 and SSD: the semiseparable-matrix view

Section 9.2 wrote the input $\to$ output map entry by entry. Collect those entries into a matrix and a second, matmul-friendly computation appears. Stacking $y = M u + D u$ with

M_{kj} = \outputmat_k\,\Phi(k, j)\,\discB_j = \sum_{n} \outputmat_{k,n}\, \exp\!\Big(\statemat_n \sum_{i=j+1}^{k} \stepsize_i\Big)\, \stepsize_j\, \inputmat_{j,n} \quad (k \ge j),

and $M_{kj} = 0$ for $k < j$ , gives a lower-triangular $L \times L$ matrix. This is the object Mamba-2 organizes the layer around, and it has exactly the structure that makes both computations cheap.

Theorem 9.4.

(Structured state-space duality.) For a diagonal state matrix $\statemat \in \R^N$ , the selective-SSM output map $M$ is $N$ -semiseparable: every submatrix lying strictly below the diagonal has rank at most $N$ . Consequently $y = M u$ admits two computations of the same result — a linear one, the recurrent scan of §9.3 in $O(L N)$ work and $O(N)$ memory, and a quadratic one, forming $M$ and multiplying, in $O(L^2 N)$ work using dense matrix multiplies.

The rank bound is the duality’s engine (Exercise 9.6, §9.9): writing the accumulated decay as $\exp(\statemat_n(c_k - c_j))$ with $c_k = \sum_{i \le k}\stepsize_i$ factors each strictly-lower block into a sum of $N$ rank-one outer products. The two computations trade off by regime: the recurrent (scan) mode wins for long sequences and autoregressive inference; the quadratic (matmul) mode wins for short chunks, because dense matrix multiplies are what tensor cores do fastest. Mamba-2’s production algorithm chunks the sequence and uses the quadratic mode within chunks and the linear mode across them — both unlocked by the same structure. The companion confirms the two modes agree to a measured residual $\sim 10^{-13}$ (pinned below $10^{-12}$ by test_recurrent_equals_matmul) and certifies the rank bound numerically.

Two panels. Left: a heatmap of the absolute SSD matrix M, nonzero on and below the diagonal, with magnitudes that vary along each diagonal. Right: the singular values of an off-diagonal block on a log scale, with a sharp cliff after index N. — The SSD matrix $M$ for a selective system with $N=4$ modes. Left: $M$ is lower-triangular and, because the system is time-varying, its diagonals are not constant (not Toeplitz, §9.1). Right: an off-diagonal block has exactly $N$ significant singular values then a cliff to machine zero — the $N$-semiseparable rank bound made visible. Produced by companions/ch09/jax/ssd_semiseparable.py; pinned by test_ssm_matrix_is_n_semiseparable.

9.6 Duality with linear attention

The semiseparable view pays its largest dividend when the state matrix is a scalar. Mamba-2 restricts $\statemat = a\,I$ — one shared decay per head — which makes the accumulated decay a scalar, $L_{kj} = \exp\!\big(a \sum_{i=j+1}^{k}\stepsize_i\big)$ , and factors the matrix entry into a decay times an inner product:

M_{kj} = L_{kj}\,(\outputmat_k \cdot \inputmat_j)\,\stepsize_j \quad\Longleftrightarrow\quad M = L \circ (\outputmat\, \inputmat^{\top})\,\diag(\stepsize),

with $\circ$ the Hadamard product. Read the right-hand side as attention: $\outputmat$ are queries, $\inputmat$ are keys, $\outputmat\,\inputmat^\top$ is the score matrix, $u$ are values, and $L$ is a causal decay mask in place of softmax. Two things separate it from softmax attention. The mask $L_{kj} = \exp(a\sum_{i=j+1}^{k}\stepsize_i)$ is fixed by the accumulated step sizes between the two positions, not by a content–content interaction of $k$ and $j$ ; and the scores enter linearly — there is no softmax nonlinearity — which is precisely what lets the same operator also run in the linear-cost recurrent mode (§9.5). It is linear attention, not softmax attention.

Theorem 9.5.

(Selective SSM as masked linear attention.) For scalar $\statemat = a I$ , the selective SSM output equals $y = \big(L \circ (\outputmat\,\inputmat^{\top})\big)\,\diag(\stepsize)\,u + D u$ , a masked linear attention with decay mask $L_{kj} = \exp(a\sum_{i=j+1}^k \stepsize_i)$ . The mask $L$ is 1-semiseparable (rank-one strictly-lower blocks). The recurrent scan and this masked matmul are the SSM and attention faces of one operator.

This is the formal sense in which “transformers are SSMs” Dao & Gu (2024) : the linear-attention layer of Katharopoulos et al. Katharopoulos et al. (2020) and the selective scan compute the same family of contractions, distinguished only by which mode — recurrent or quadratic — is run. The companion verifies the identity as a matrix equation, with the directly-built $M$ and the masked-attention form agreeing to $\sim 10^{-14}$ (pinned below $10^{-12}$ by test_scalar_A_equals_masked_attention), and certifies that $L$ is 1-semiseparable. Chapter 11 enters this same duality from the attention side — linear attention and the delta-rule lineage are the “transformer” reading of the structure this section reached from the SSM side.

9.7 What’s next

The thread of this chapter is one structural change and its consequences. Making $\stepsize, \inputmat, \outputmat$ depend on the input turns the LTI SSM into a linear time-varying system (§9.2); time-variance breaks the convolution–recurrence duality, so no single kernel exists (§9.1) and the §8.6 associative scan — fed input-dependent transitions — becomes the parallel primitive the layer is computed with (§9.3), which is why Mamba needs a fused, recompute-in-backward kernel (§9.4). Re-reading that scan as a matrix exposes its $N$ -semiseparable structure and the recurrent/quadratic duality (§9.5), and for scalar state that matrix is masked linear attention (§9.6).

Chapter 10 keeps selectivity and changes the discretization: Mamba-3 moves to a complex state and replaces zero-order hold with the exponential-trapezoidal scheme of Chapter 4 when the selective dynamics turn stiff Lahoti et al. (2026) . That chapter is where the symplectic-integrator pilot finds its empirical anchor — the selective layer built here, run with a second-order geometric integrator. Chapter 11 takes the §9.6 duality the other way, developing linear attention and the delta-rule lineage as the attention-side reading of the same structured contractions. The selective scan and the semiseparable matrix are the two objects everything downstream reuses.

9.8 Exercises

Six problems mixing computation and theory. Short/numerical (9.1–9.3) have inline collapsible solutions; the theory exercises (9.4–9.6) have full worked solutions in §9.9.

Exercise 9.1 (computation)

Take one stable mode $\statemat = -1$ with $\inputmat = \outputmat = 1$ , and two step sizes: a “write” step $\stepsize_{\text{w}} = 1.5$ and a “hold” step $\stepsize_{\text{h}} = 0.01$ . (a) Compute the discrete transitions $\discA_{\text{w}} = e^{\stepsize_{\text{w}}\statemat}$ and $\discA_{\text{h}} = e^{\stepsize_{\text{h}}\statemat}$ . (b) A value written into the state then held for $20$ steps is multiplied by $\discA_{\text{h}}^{20}$ ; compute the retained fraction. (c) Contrast with an LTI system that must use $\stepsize = \stepsize_{\text{w}}$ for all $20$ steps. What does this say about selectivity?

Solution

(a) $\discA_{\text{w}} = e^{-1.5} = 0.2231$ (gate nearly open: most of the old state is overwritten), $\discA_{\text{h}} = e^{-0.01} = 0.9900$ (gate nearly closed: the state is held). (b) Holding for 20 steps retains $\discA_{\text{h}}^{20} = e^{-0.2} = 0.8187$ , about $82\%$ . (c) The LTI system stuck at $\stepsize_{\text{w}}$ retains $\discA_{\text{w}}^{20} = e^{-30} \approx 9\times10^{-14}$ — total forgetting. Selectivity is precisely the freedom to make $\discA_t$ near $1$ on tokens worth holding and near $0$ on tokens worth overwriting; a fixed $\stepsize$ cannot do both, which is the §9.2 figure in two numbers.

Exercise 9.2 (code)

Run companions/ch09/jax/selective_scan_demo.py and confirm the selective associative scan matches the sequential recurrence (the companion reports a residual $\sim 10^{-15}$ ). Then explain why the S4 trick of computing the same output by an FFT convolution is unavailable here, referencing Proposition (§9.1).

Solution

The associative scan and the sequential recurrence compute the identical linear recurrence by different reduction orders, so for the same $(\discA_t, \discB_t u_t)$ they agree to floating-point error (the companion measures $\sim 10^{-15}$ , pinned below $10^{-12}$ ). The FFT path is unavailable because it computes a convolution $y = K * u$ , which requires a single kernel $K$ . By Proposition (§9.1) no such kernel exists once $\discA_t$ is input-dependent: the map is not Toeplitz, so it is not a convolution, so there is nothing to transform. The scan, by contrast, only needs the per-step pairs and an associative operator, neither of which assumes time-invariance (§9.3).

Exercise 9.3 (computation)

Consider a scalar-state ( $N = 1$ ) selective system of length $4$ with shared decay $a$ , step sizes $\stepsize_1, \dots, \stepsize_4$ , scalar projections $\inputmat_j, \outputmat_k$ . Write the entry $M_{kj}$ of the SSD matrix for $k \ge j$ , and argue that any strictly-lower block (e.g. rows $\{3,4\}$ , columns $\{1,2\}$ ) has rank $1$ .

Solution

With $N=1$ , $M_{kj} = \outputmat_k\,\exp\!\big(a\sum_{i=j+1}^k\stepsize_i\big)\,\stepsize_j\,\inputmat_j$ for $k\ge j$ . Write $c_m = \sum_{i\le m}\stepsize_i$ , so the accumulated decay is $\exp(a(c_k - c_j)) = e^{a c_k}\,e^{-a c_j}$ . Then for $k\ge j$ ,

M_{kj} = \underbrace{\big(\outputmat_k\, e^{a c_k}\big)}_{\phi(k)}\;\underbrace{\big(e^{-a c_j}\,\stepsize_j\,\inputmat_j\big)}_{\psi(j)} = \phi(k)\,\psi(j).

The block on rows $\{3,4\}$ , columns $\{1,2\}$ is $\big[\phi(k)\psi(j)\big]_{k\in\{3,4\},\,j\in\{1,2\}} = \phi_{\{3,4\}}\,\psi_{\{1,2\}}^\top$ , an outer product of two vectors — rank $1$ . Every strictly-lower block factors the same way, so the matrix is $1$ -semiseparable.

Exercise 9.4 (theory) — solution in §9.9

Prove Proposition (§9.1): if the discrete transition $\discA_t$ is input-dependent, the causal linear map $u \mapsto y - Du$ is not Toeplitz, so no convolution kernel $K$ with $y = K*u + Du$ exists. (Hint: a causal linear map is a convolution iff its matrix has constant diagonals.)

Exercise 9.5 (theory) — solution in §9.9

Prove Proposition (§9.3): the operator $(a_1,b_1)\bullet(a_2,b_2) = (a_2 a_1,\, a_2 b_1 + b_2)$ is associative for arbitrary $a_1, a_2$ , and its inclusive prefix scan over $(\discA_t, \discB_t u_t)$ reproduces the LTV states $\statevec_t$ . Identify the single place the §8.6 argument used time-invariance and confirm the selective scan does not need it.

Exercise 9.6 (theory) — solution in §9.9

Prove Theorem (§9.5): for diagonal $\statemat \in \R^N$ , every strictly-lower-triangular block of the SSD matrix $M_{kj} = \sum_n \outputmat_{k,n}\exp(\statemat_n\sum_{i=j+1}^k\stepsize_i)\stepsize_j\inputmat_{j,n}$ has rank $\le N$ . Then specialize to scalar $\statemat = aI$ to derive the masked-attention form $M = L \circ (\outputmat\,\inputmat^\top)\diag(\stepsize)$ of Theorem (§9.6) and show $L$ is $1$ -semiseparable.

9.9 Full solutions to theory exercises

Solution to Exercise 9.4

A causal linear map $u \mapsto v$ on sequences of length $L$ is represented by a lower-triangular matrix $M$ with $v_k = \sum_{j\le k} M_{kj} u_j$ . It is a convolution $v = K * u$ — i.e. $v_k = \sum_{j\le k} K_{k-j} u_j$ — if and only if $M_{kj}$ depends only on the offset $k - j$ , which is exactly the statement that $M$ is Toeplitz (constant along each diagonal $k - j = \text{const}$ ).

For the selective system, the diagonal- $\statemat$ entry is

M_{kj} = \sum_n \outputmat_{k,n}\,\exp\!\Big(\statemat_n\!\sum_{i=j+1}^k \stepsize_i\Big)\,\stepsize_j\,\inputmat_{j,n}, \qquad k \ge j .

Consider the main diagonal $k = j$ (offset $0$ ): the accumulated sum is empty, so $M_{kk} = \sum_n \outputmat_{k,n}\,\stepsize_k\,\inputmat_{k,n} = \stepsize_k\,(\outputmat_k \cdot \inputmat_k)$ . Because $\stepsize_k, \outputmat_k, \inputmat_k$ are input-dependent, $M_{kk}$ varies with $k$ whenever the input does, so the main diagonal is not constant and $M$ is not Toeplitz. (More strongly, for $k > j$ the factor $\sum_{i=j+1}^k \stepsize_i$ is not a function of $k - j$ once $\stepsize$ varies, so no off-diagonal is constant either.) A non-Toeplitz causal map is not a convolution, so no kernel $K$ exists; the time-invariant case, where $\discA_t \equiv \discA$ makes $M_{kj} = \outputmat\,\discA^{\,k-j}\discB$ depend only on $k-j$ , is the sole case that is Toeplitz. $\blacksquare$

Solution to Exercise 9.5

Associativity. Write a pair $(a, b)$ as the affine map $h \mapsto a h + b$ (with $a$ acting elementwise). Composition of affine maps is associative because function composition is; concretely, with $\bullet$ as defined,

\big((a_1,b_1)\bullet(a_2,b_2)\big)\bullet(a_3,b_3) = (a_3 a_2 a_1,\; a_3 a_2 b_1 + a_3 b_2 + b_3),

(a_1,b_1)\bullet\big((a_2,b_2)\bullet(a_3,b_3)\big) = (a_3 a_2 a_1,\; a_3 a_2 b_1 + a_3 b_2 + b_3),

and the two agree. Nowhere did this computation assume $a_1 = a_2 = a_3$ — that is the single place §8.6 silently used (and did not need) time-invariance. Prefix scan. Let $e_t = (\discA_t, \discB_t u_t)$ and let $s_k = e_1 \bullet \cdots \bullet e_k$ be the inclusive scan. By induction on $k$ the second component of $s_k$ equals $\statevec_k$ : for $k=1$ it is $\discB_1 u_1 = \statevec_1$ (from $\statevec_0 = 0$ ); and $s_k = s_{k-1}\bullet e_k$ has second component $\discA_k\cdot(\text{2nd comp of }s_{k-1}) + \discB_k u_k = \discA_k \statevec_{k-1} + \discB_k u_k = \statevec_k$ . Since $\bullet$ is associative, all prefixes are computable by a balanced reduction tree in depth $O(\log L)$ Blelloch (1990) . The selective scan is thus the §8.6 scan verbatim, with time-varying $\discA_t$ in place of a constant $\discA$ — the operator’s associativity never cared. $\blacksquare$

Solution to Exercise 9.6

Semiseparability. Fix a strictly-lower block: rows $k \in \mathcal{K}$ and columns $j \in \mathcal{J}$ with $\min\mathcal{K} > \max\mathcal{J}$ , so $k > j$ throughout and the accumulated sum is nonempty. Write $c_m = \sum_{i=1}^m \stepsize_i$ , so $\sum_{i=j+1}^k \stepsize_i = c_k - c_j$ and $\exp(\statemat_n(c_k - c_j)) = e^{\statemat_n c_k}\,e^{-\statemat_n c_j}$ . Then

M_{kj} = \sum_{n=1}^N \big(\outputmat_{k,n}\, e^{\statemat_n c_k}\big)\,\big(e^{-\statemat_n c_j}\,\stepsize_j\,\inputmat_{j,n}\big) = \sum_{n=1}^N \phi_n(k)\,\psi_n(j),

with $\phi_n(k) = \outputmat_{k,n} e^{\statemat_n c_k}$ and $\psi_n(j) = e^{-\statemat_n c_j}\stepsize_j\inputmat_{j,n}$ . As a matrix the block is $\sum_{n=1}^N \phi_n\,\psi_n^\top = \Phi\Psi^\top$ with $\Phi \in \R^{\abs{\mathcal{K}}\times N}$ , $\Psi \in \R^{\abs{\mathcal{J}}\times N}$ , hence rank $\le N$ . Every strictly-lower block has this form, so $M$ is $N$ -semiseparable. (The factorization is exactly what the chunked algorithm exploits: $\phi$ is read out from the per-chunk states, $\psi$ folded into them.)

Scalar specialization. Put $\statemat = aI$ , i.e. $\statemat_n = a$ for all $n$ . The decay no longer depends on $n$ , so it pulls out of the sum:

M_{kj} = \exp\!\Big(a\!\sum_{i=j+1}^k \stepsize_i\Big)\,\stepsize_j \sum_{n=1}^N \outputmat_{k,n}\inputmat_{j,n} = L_{kj}\,(\outputmat_k \cdot \inputmat_j)\,\stepsize_j,

with $L_{kj} = \exp(a(c_k - c_j))$ . In matrix form $M = L \circ (\outputmat\,\inputmat^\top)\,\diag(\stepsize)$ , the masked-attention form of Theorem (§9.6): $\outputmat\,\inputmat^\top$ is the score matrix and $L$ the decay mask. Finally $L_{kj} = e^{a c_k}e^{-a c_j}$ for $k\ge j$ , so any strictly-lower block of $L$ is the outer product $(e^{a c_k})_{k}\,(e^{-a c_j})_{j}^\top$ — rank one. Thus $L$ is $1$ -semiseparable, and the full $M$ is $N$ -semiseparable with $N$ the state size (here the rank of the score matrix). $\blacksquare$

9.10 Companion code

Two language tracks for Chapter 9 — JAX (the reference, with the parallel selective scan and the semiseparable-matrix tools) and PyTorch (the nn.Module layers and the masked-attention dual). Julia is omitted, as in Chapter 8: the selective scan and the SSD matrix are the same algorithm in any language, and the JAX/PyTorch pair already carries the cross-framework lesson. The two tracks are pinned bit-for-bit against each other in float64.

JAX (companions/ch09/jax/):

selective_ssm.py — the SISO selective-SSM core: the selection mechanism, the per-step ZOH discretization, and the selective scan via jax.lax.associative_scan with its sequential oracle. The §9.3 equivalence lives here.
selective_vs_lti.py — emits selective-vs-lti.png (§9.2): the held-vs-forgotten single-mode contrast.
selective_scan_demo.py — emits selective-scan.png (§9.3) and memory-cost.png (§9.4): scan-vs-sequential agreement and depth, and the materialized-vs-fused memory accounting.
ssd_semiseparable.py — the SSD tools: segsum, the semiseparable matrix build_ssm_matrix, the rank certificate is_n_semiseparable, and the scalar- $\statemat$ masked_attention_form; emits semiseparable.png (§9.5).
tests/ — test_selective.py pins the selective scan-equivalence, stability by construction, and the not-Toeplitz collapse; test_ssd.py pins the recurrent-vs-matmul duality, $N$ -semiseparability, and the scalar- $\statemat$ masked-attention identity.

PyTorch (companions/ch09/torch/):

selective_ssm.py — the selective SSM as an nn.Module (A_log a learnable nn.Parameter, selection projections as nn.Linear, eager sequential scan), plus the functional core mirrored from JAX.
ssd_matmul.py — the scalar- $\statemat$ dual form as SSDAttention: masked linear attention with the decay mask in place of softmax.
tests/test_selective_torch.py — cross-framework parity: the torch selective scan and SSD matrices equal their JAX counterparts to within $10^{-9}$ .

To run from the repo root:

# JAX (uses the uv .venv with jax, numpy, matplotlib)
PYTHONPATH=. python companions/ch09/jax/selective_vs_lti.py
PYTHONPATH=. python companions/ch09/jax/selective_scan_demo.py
PYTHONPATH=. python companions/ch09/jax/ssd_semiseparable.py

# PyTorch (needs the .venv [torch] extra)
PYTHONPATH=. python companions/ch09/torch/selective_ssm.py
PYTHONPATH=. python companions/ch09/torch/ssd_matmul.py

All figures emit to public/figures/ch09/.

Part II · SSM Core Week 10 Published

Mamba-3 and the exponential-trapezoidal integrator

How Mamba-3 keeps selectivity and changes two things — a second-order exponential-trapezoidal integrator and a complex state — turning the dynamical-systems lens of Chapters 4-6 onto a production sequence model.

Mamba-3 and the exponential-trapezoidal integrator

Chapter 10 — at a glance

Goal: see how Mamba-3 keeps the selectivity of Chapter 9 and changes exactly two things of dynamical-systems substance — the integrator (zero-order hold becomes a second-order exponential-trapezoidal scheme; §10.2’s trapezoidal-quadrature member of the Chapter 4 family) and the state (real diagonal becomes complex, so one head can both decay and oscillate). This is where the numerical-analysis machinery of Chapters 4-6 meets a production architecture.

Reading time: ~45 minutes prose; 2 hours with the companions.

Direct-transfer hook (C1 pilot): Mamba-3 is the symplectic-integrator pilot’s empirical anchor. The pilot’s question is exactly the one this chapter sets up — does a higher-order, structure-aware integrator help a selective SSM, and at what stability cost? Exp-trapezoidal is the first step (a second-order exponential integrator); §10.3 marks where the geometric integrators of Chapter 6 take over as the open research direction (revisited in Chapter 17).

Key insight: for an exponential integrator the state transition $\discA = e^{\statemat \stepsize}$ is computed exactly, so stability is decided by $\statemat$ alone and accuracy is decided by the input quadrature alone — the two concerns that are entangled for Runge-Kutta methods here come apart. The order upgrade lives entirely in how the forcing is approximated.

10.1 Two upgrades to the selective SSM

Chapter 9 built the selective SSM: input-dependent $(\stepsize_t, \inputmat_t, \outputmat_t)$ make the dynamics linear time-varying, the convolution view collapses, and the selective scan plus the semiseparable-matrix view (SSD) carry the computation. Mamba-1 and Mamba-2 both discretize with zero-order hold (ZOH) and keep a real diagonal state matrix $\statemat$ .

Mamba-3 Lahoti et al. (2026) changes two things of dynamical-systems substance and leaves the rest of the SSD scaffold (the semiseparable structure of Theorem 9.4, the scan, the attention duality of Theorem 9.5) intact:

A second-order integrator. ZOH is first-order accurate; Mamba-3 adopts a second-order exponential-trapezoidal scheme — the exponential-integrator idea of Chapter 4 §4.5 in its trapezoidal-quadrature form (§10.2) — while keeping ZOH’s exact treatment of the homogeneous dynamics (§10.2-10.3).
A complex state. A real diagonal mode $\statemat < 0$ can only decay; a complex mode $\statemat = \log\rho + i\theta$ decays and rotates, so a single head becomes a damped oscillator (§10.4).

Both changes are dynamical-systems moves, and both have been waiting since the foundations chapters: the integrator is the Chapter 4-5 story applied to a real architecture, and the complex state is the spiral phase portrait of Chapters 1-2. The C1 pilot’s empirical question — whether a second-order, structure-preserving integrator actually buys accuracy on a selective SSM without costing stability — is the through-line.

10.2 The exponential-trapezoidal scheme

Recall the continuous mode $\dot{\statevec} = \statemat \statevec + u(t)$ , whose exact one-step map is $\statevec_k = e^{\statemat\stepsize}\statevec_{k-1} + \int_0^{\stepsize} e^{\statemat(\stepsize - s)}\, u(t_{k-1}+s)\,ds$ . Every exponential integrator keeps the transition $e^{\statemat\stepsize}$ exact (the philosophy of Chapter 4 §4.5) and approximates only the forcing integral. ZOH holds the input constant at the left endpoint; the exponential-trapezoidal scheme applies the trapezoidal quadrature rule to the forcing integral — averaging the integrand at the two endpoints. The integrand is $e^{\statemat\stepsize} u_{k-1}$ at $s = 0$ and $u_k$ at $s = \stepsize$ , so the rule gives $\tfrac{\stepsize}{2}(e^{\statemat\stepsize} u_{k-1} + u_k)$ . Weighting the two endpoints by $\lambda \in [0,1]$ gives the three-term update

\statevec_k = \alpha\, \statevec_{k-1} + \beta\, u_{k-1} + \gamma\, u_k, \qquad \alpha = e^{\statemat \stepsize},\; \beta = (1-\lambda)\,\stepsize\,\alpha,\; \gamma = \lambda\,\stepsize .

The contrast with ZOH is the third term: where ZOH reads only the left endpoint $u_{k-1}$ , the trapezoid reads both endpoints. At $\lambda = \tfrac12$ this is the symmetric trapezoidal rule (globally second-order); at $\lambda = 1$ it collapses to a shifted ZOH (first order). Mamba-3 makes $\lambda$ itself input-dependent (a sigmoid of a learned projection), one more selective knob.

This single fact — that $\alpha$ is identical for the two schemes — has a sharp consequence for how the order can be measured.

Theorem 10.1 (Second order lives in the forcing, and is invisible without it).

For $\lambda = \tfrac12$ the exponential-trapezoidal scheme has local truncation error $O(\stepsize^3)$ per step, hence global order 2, provided the input $u$ is non-constant and $C^2$ . On the homogeneous system ( $u \equiv 0$ ) both ZOH and exp-trapezoidal reproduce the exact solution $\statevec_k = e^{\statemat t_k} \statevec_0$ to machine precision, so the order difference is unobservable there.

The proof is Exercise 10.4 (§10.9). The practical reading is a trap worth naming: because the homogeneous transition is exact, a convergence study run on an autonomous system shows ZOH and exp-trapezoidal as identical lines at roundoff — the second order is simply not visible. The companion confirms this directly: on the autonomous mode both schemes sit at $\sim 2 \times 10^{-15}$ , while on the forced mode the fitted convergence slopes separate cleanly into $1.00$ (ZOH) and $2.00$ (exp-trapezoidal). This is the same lesson Chapter 4 flagged in Exercise 4.3 (“setting $u \equiv 0$ makes ZOH exact; the slope is meaningless”), now load-bearing for verifying a real architecture’s headline claim.

Log-log plot of maximum error versus step size on a forced complex mode, ZOH at slope 1 and exponential-trapezoidal at slope 2. — Global error versus step size on the forced mode $\dot{x} = A x + \sin(\omega t)$ with $A = -0.5 + 2i$. ZOH (navy) fits slope 1.00; exponential-trapezoidal (rose) fits slope 2.00; dotted guides mark the reference slopes. The order separation appears only because the input is non-constant — on the homogeneous system both lines collapse to roundoff. Produced by companions/ch10/jax/discretization.py; the slopes are pinned in tests/test_discretization.py.

10.3 Stability and accuracy decouple

Why adopt an exponential integrator rather than a higher-order Runge-Kutta method? Because the exponential transition is exact, the discrete amplification factor is $|\alpha| = e^{\operatorname{Re}(\statemat)\stepsize}$ , which is $< 1$ for any stable mode ( $\operatorname{Re}(\statemat) < 0$ ) at any step size. ZOH and exp-trapezoidal are therefore unconditionally stable on the homogeneous part — A-stable over the entire left half-plane — and, crucially, stability is set by $\statemat$ alone, independent of the order of the forcing quadrature. For a Runge-Kutta method a single stability polynomial controls both accuracy and the stability region; the exponential integrator decouples them.

Theorem 10.2 (Exponential schemes are A-stable and damp stiff modes).

For $\operatorname{Re}(\statemat) < 0$ , the exponential-trapezoidal (and ZOH) amplification factor satisfies $|\alpha| = e^{\operatorname{Re}(\statemat) \stepsize} < 1$ for every $\stepsize > 0$ , and $|\alpha| \to 0$ as the mode stiffens ( $\operatorname{Re}(\statemat)\stepsize \to -\infty$ ). The bilinear (Tustin) scheme is also A-stable but uses the $(1,1)$ -Padé approximation $\alpha = (1 + \statemat\stepsize/2)/(1 - \statemat\stepsize/2)$ , for which $|\alpha| \to 1$ as the mode stiffens: stiff modes are left undamped.

The proof is Exercise 10.6 (§10.9). The companion makes the contrast quantitative: at a stiff $\statemat\stepsize = -30$ , the exponential schemes give $|\alpha| \approx 9.4 \times 10^{-14}$ (fully damped) while bilinear gives $|\alpha| = 0.875$ (still ringing). This is why an exponential integrator is the right tool when the input-dependent $\statemat(u_t)$ of a selective SSM produces stiff dynamics, as Chapter 4 §4.5 anticipated for exactly this chapter. One caveat matters for §10.4: the theorem bounds the magnitude $|\alpha|$ only. For a complex mode it says nothing about how faithfully the discrete rotation tracks the continuous phase $\theta\stepsize$ — phase fidelity is a separate question, and the open one the frontier paragraph below raises.

Two panels: left, A-stability regions in the complex plane; right, amplification magnitude versus stiffness for exponential versus bilinear schemes. — Left: the exponential schemes are A-stable on the whole left half-plane ($\operatorname{Re}(A\Delta) \le 0$), whereas forward Euler is stable only inside its unit disk. Right: along the negative real axis the exponential amplification $e^z$ decays to zero (stiff modes damped, navy) while the bilinear Padé factor tends to one (stiff modes undamped, rose). Produced by companions/ch10/jax/discretization.py; the stiff-mode contrast is pinned in tests/test_discretization.py.

The geometric-integrator frontier (where the pilot goes next). Exp-trapezoidal is a second-order exponential integrator; it is not symplectic. For the oscillatory complex modes introduced in §10.4, the question the symplectic-integrator pilot asks is whether a geometric integrator (Chapter 6) — one that preserves a phase-space invariant rather than merely a damping bound — controls long-horizon phase error better than exp-trapezoidal does. That is an open empirical question, not a settled result, and it is the heart of the “discretization stability atlas” the pilot pursues; Chapter 17 returns to it. This chapter establishes the settled part: exp-trapezoidal is second-order, A-stable, and exact on the homogeneous part.

10.4 Complex state: decay and oscillation in one head

A real diagonal mode $\statemat < 0$ gives $\alpha = e^{\statemat\stepsize} \in (0,1)$ : pure geometric decay, a single forgetting timescale. Mamba-3 lets the mode be complex, $\statemat = \log\rho + i\theta$ , so

\alpha = e^{\statemat\stepsize} = \underbrace{\rho^{\stepsize}}_{\text{decay}}\; \underbrace{e^{i\theta\stepsize}}_{\text{rotation}} .

The magnitude decays at rate $\log\rho$ per unit step and the phase advances by $\theta\stepsize$ : a single head is now a damped oscillator, tracing the logarithmic spiral of Chapters 1-2 rather than a monotone decay. This lets one head represent an oscillatory feature — a frequency — instead of only a timescale.

In production Mamba-3 never stores a complex dtype. It keeps a real state and applies rotary position embeddings (RoPE) — 2-D rotations — to the projections $\inputmat, \outputmat$ . The justification is the elementary isomorphism between complex multiplication and $2\times 2$ rotation-scaling.

Theorem 10.3 (RoPE rotation equals complex multiplication).

Representing $z = a + bi$ as the real vector $[a, b]^\top$ , multiplication by $\rho e^{i\theta}$ equals applying the scaled rotation $\rho R(\theta) = \rho\left(\begin{smallmatrix}\cos\theta & -\sin\theta\\ \sin\theta & \cos\theta\end{smallmatrix}\right)$ . Consequently the complex recurrence $x_k = (\rho e^{i\theta}) x_{k-1} + b_k$ and the real 2-D recurrence $s_k = \rho R(\theta) s_{k-1} + [\operatorname{Re} b_k, \operatorname{Im} b_k]^\top$ produce identical trajectories.

The proof is Exercise 10.5 (§10.9). The companion pins the two recurrences equal to $\sim 3 \times 10^{-15}$ under a random complex drive, and confirms the spiral’s decay rate equals $\log\rho$ exactly — the empirical content of “decay and oscillation in one head.”

Left, the spiral trajectory of a complex mode in the real-imaginary plane; right, its magnitude decaying geometrically on a log scale. — A single complex mode $\rho = 0.95$, $\theta = 20^\circ$ per step. Left: the state spirals inward in the $(\operatorname{Re}, \operatorname{Im})$ plane — simultaneous decay and rotation. Right: the magnitude follows the $\rho^k$ envelope exactly (measured per-step decay rate $-0.0513 = \log\rho$). Produced by companions/ch10/jax/complex_state.py; the decay rate and RoPE equivalence are pinned in tests/test_complex_state.py.

10.5 The trapezoidal SSD pass

The second-order stencil must coexist with the SSD machinery of Chapter 9 — the semiseparable matrix that lets the layer run as a parallel matmul or a linear-time scan. It does, because only the injection stencil changes. Unrolling the trapezoidal recurrence with the state-transition operator $\Phi(k,j) = \prod_{i=j+1}^{k} \alpha_i = \exp(\statemat \sum_{i=j+1}^{k}\stepsize_i)$ (the same $\Phi$ as Definition 9.2) splits the output into two semiseparable streams that share the one decay $\Phi$ : a right-endpoint stream weighting source $j$ by $\lambda\stepsize_j$ , and a left-endpoint stream weighting source $j$ by $(1-\lambda)\stepsize_{j+1}$ across the strictly-lower triangle. Both reuse the Chapter 9 segment-sum decay; the order-2 upgrade is two cheap masked matmuls in place of one.

The companion verifies that the dense two-stream matmul reproduces the sequential trapezoidal recurrence to $\sim 2 \times 10^{-14}$ (for both real and complex modes), and that the end-to-end output is genuinely second-order — halving the step size cuts the error by a factor of four. The SSD duality of §9.5 survives the integrator upgrade intact: same matrix, two schedules, now order 2.

10.6 MIMO and the production block

Two further Mamba-3 changes sit outside the integrator-and-state story this chapter develops, and we treat them briefly. MIMO rank- $R$ state mixing (the input and output projections each gain a rank- $R$ dimension) raises decode-time arithmetic intensity at fixed state size and, Mamba-3 reports, also lifts model quality through the added expressivity (a $\sim 1.2$ -point average downstream gain over the SISO variant) — but neither effect changes the discretization or stability story. The full production block — SwiGLU gating, QK-normalization, the $\inputmat\outputmat$ bias — wraps the layer studied here; those too are orthogonal to the integrator-and-state question, and we point to the reference implementation rather than reproduce them Lahoti et al. (2026) .

What this chapter isolates is the dynamical-systems substance: a selective SSM (Chapter 9) run with a second-order exponential integrator (Chapters 4-5) and a complex state (Chapters 1-2, 8). The open program — whether geometric integrators (Chapter 6) preserve long-horizon phase structure better than the exponential scheme over the learned eigenvalue spectrum — is the discretization stability atlas the C1 pilot builds, and Chapter 17 returns to it.

10.7 What’s next

Mamba-3 kept Chapter 9’s selectivity and changed two things: a second-order exponential-trapezoidal integrator (accuracy in the forcing, stability from the exact exponential) and a complex state (decay and oscillation in one head), realized via RoPE and computed through the trapezoidal SSD pass. This closes the core SSM line — HiPPO (Chapter 7) to LTI S4/S4D/S5 (Chapter 8) to selective Mamba (Chapter 9) to Mamba-3 (here).

Chapter 11 opens the “beyond SSM” part by taking the SSD duality of §9.6 the other way: linear attention and the Hyena long-convolution lineage enter from the attention side of Theorem 9.5, where the structured contractions of these four chapters reappear wearing transformer clothing.

10.8 Exercises

Six problems: three short (inline solutions), three longer (full solutions in §10.9).

Exercise 10.1 (short)

For a complex mode $\statemat = \log\rho + i\theta$ with $\rho = 0.9$ , $\theta = \pi/6$ , and step $\stepsize = 1$ , compute $|\alpha|$ and the phase advance of $\alpha = e^{\statemat\stepsize}$ over two steps. Which is the decay timescale and which is the oscillation frequency?

Solution

$\alpha = \rho\, e^{i\theta} = 0.9\, e^{i\pi/6}$ , so $|\alpha| = 0.9$ per step and the phase advances $\theta = \pi/6 = 30^\circ$ per step. Over two steps $|\alpha^2| = 0.81$ and the phase advances $60^\circ$ . The magnitude $\rho = 0.9$ sets the decay timescale; $\theta = \pi/6$ sets the oscillation frequency. They are independent — that is the point of a complex mode.

Exercise 10.2 (short, code)

Run companions/ch10/jax/discretization.py. Confirm the fitted convergence slopes are $\approx 1$ for ZOH and $\approx 2$ for exp-trapezoidal on the forced system. Then explain, in one sentence, why the same script reports both schemes at $\sim 10^{-15}$ on the homogeneous system.

Solution

The script prints ZOH slope $1.00$ and exp-trapezoidal slope $2.00$ on the forced mode. On the homogeneous mode the transition $\alpha = e^{\statemat\stepsize}$ is exact for both schemes, so both reproduce $\statevec_0 e^{\statemat t_k}$ to floating-point roundoff and the order difference — which lives entirely in the forcing coefficients $\beta, \gamma$ — never enters (Theorem 10.1).

Exercise 10.3 (short)

Write the exp-trapezoidal coefficients $(\alpha, \beta, \gamma)$ for $\lambda = 0$ , $\lambda = \tfrac12$ , and $\lambda = 1$ . Show that $\lambda = 1$ reduces to a first-order shifted ZOH (no current-input term carrying the decay).

Solution

$\alpha = e^{\statemat\stepsize}$ throughout. $\lambda = 0$ : $\beta = \stepsize\,\alpha$ , $\gamma = 0$ (pure left endpoint). $\lambda = \tfrac12$ : $\beta = \tfrac{\stepsize}{2}\alpha$ , $\gamma = \tfrac{\stepsize}{2}$ (symmetric). $\lambda = 1$ : $\beta = 0$ , $\gamma = \stepsize$ — only the current input $u_k$ enters, with no decay applied to it, which is a first-order ZOH shifted by one step. The symmetric $\lambda = \tfrac12$ is the only second-order member.

Exercise 10.4 (theory) — solution in §10.9

Prove Theorem 10.1: for $\lambda = \tfrac12$ the exponential-trapezoidal scheme has local truncation error $O(\stepsize^3)$ on a forced system, and show that on the homogeneous system both ZOH and exp-trapezoidal are exact.

Exercise 10.5 (theory) — solution in §10.9

Prove Theorem 10.3: that a 2-D rotation by $\theta$ on $[\operatorname{Re} z, \operatorname{Im} z]$ equals multiplication of $z$ by $e^{i\theta}$ , and conclude the complex and real-RoPE recurrences coincide.

Exercise 10.6 (theory) — solution in §10.9

Prove Theorem 10.2: that the exponential schemes are A-stable with $|\alpha| \to 0$ on stiff modes, and contrast the bilinear Padé factor, showing $|\alpha| \to 1$ as $\statemat\stepsize \to -\infty$ .

10.9 Full solutions to theory exercises

Solution to Exercise 10.4

Write the exact one-step map by variation of constants over $[t_{k-1}, t_k]$ :

\statevec(t_k) = e^{\statemat\stepsize}\statevec(t_{k-1}) + \int_0^{\stepsize} e^{\statemat(\stepsize - s)}\, u(t_{k-1}+s)\,ds .

The transition $e^{\statemat\stepsize}$ is reproduced exactly by both schemes, so all error is in the forcing integral. Exp-trapezoidal approximates $\int_0^{\stepsize} g(s)\,ds$ , with integrand $g(s) = e^{\statemat(\stepsize-s)}\,u(t_{k-1}+s)$ , by the trapezoidal quadrature rule $\tfrac{\stepsize}{2}\big(g(0) + g(\stepsize)\big)$ — the $\lambda = \tfrac12$ endpoint weights of §10.2. The trapezoidal rule on an interval of width $\stepsize$ has error $-\tfrac{\stepsize^3}{12}\,g''(\xi)$ for some interior $\xi$ ; since $g$ is $C^2$ whenever $u$ is $C^2$ , the local truncation error is $O(\stepsize^3)$ per step, and summing $O(\stepsize^{-1})$ steps over a fixed horizon yields global order 2. For the homogeneous system $u \equiv 0$ the forcing integral vanishes identically and $\statevec_k = e^{\statemat t_k}\statevec_0$ exactly for both schemes, regardless of order — there is no forcing term to approximate, so the order distinction disappears. $\blacksquare$

Solution to Exercise 10.5

Let $z = a + bi$ and write $[z] = [a, b]^\top$ . Then

e^{i\theta} z = (\cos\theta + i\sin\theta)(a + bi) = (a\cos\theta - b\sin\theta) + i(a\sin\theta + b\cos\theta),

whose real-vector form is $[a\cos\theta - b\sin\theta,\; a\sin\theta + b\cos\theta]^\top = R(\theta)[z]$ with $R(\theta) = \left(\begin{smallmatrix}\cos\theta & -\sin\theta\\ \sin\theta & \cos\theta\end{smallmatrix}\right)$ . Scaling by $\rho$ commutes with the real isomorphism, so multiplication by $\rho e^{i\theta}$ is $\rho R(\theta)$ on $[z]$ . Applying this at every step of the affine recurrence (the additive drive $b_k$ maps to $[b_k]$ componentwise, which the isomorphism also preserves) shows the complex recurrence $x_k = (\rho e^{i\theta})x_{k-1} + b_k$ and the real recurrence $s_k = \rho R(\theta) s_{k-1} + [b_k]$ produce identical trajectories under $[\cdot]$ . $\blacksquare$

Solution to Exercise 10.6

For the exponential schemes $\alpha = e^{\statemat\stepsize}$ , so $|\alpha| = e^{\operatorname{Re}(\statemat)\stepsize}$ . If $\operatorname{Re}(\statemat) < 0$ and $\stepsize > 0$ then the exponent is negative and $|\alpha| < 1$ for every $\stepsize$ (A-stability over the whole left half-plane), and as the mode stiffens ( $\operatorname{Re}(\statemat)\stepsize \to -\infty$ ) we have $|\alpha| \to 0$ . For the bilinear scheme $\alpha = (1 + z/2)/(1 - z/2)$ with $z = \statemat\stepsize$ ; writing $z = x + iy$ with $x < 0$ , $|1 + z/2| < |1 - z/2|$ (the left point is closer to $-2$ , the right to $+2$ ), so $|\alpha| < 1$ — also A-stable. But as $x \to -\infty$ the ratio $(1 + z/2)/(1 - z/2) \to -1$ , so $|\alpha| \to 1$ : stiff modes are mapped to the unit circle and ring rather than decaying. This is the qualitative gap the exponential transition closes. $\blacksquare$

10.10 Companion code

The companions live in companions/ch10/{jax,julia,torch}/. The JAX module covers the integrator, the complex state, and the trapezoidal SSD pass; the PyTorch module mirrors the integrator and complex state (cross-framework parity); the Julia module is the discretization atlas only — the numerical-analysis core — and joins the Chapter 4-6 Julia family that the C1 pilot’s stability atlas builds on.

The discretization study (order slopes, homogeneous-blindness, stability regions):

PYTHONPATH=. python companions/ch10/jax/discretization.py
julia --project=companions/ch10/julia companions/ch10/julia/discretization.jl

The complex state and RoPE equivalence (the spiral figure):

PYTHONPATH=. python companions/ch10/jax/complex_state.py

The trapezoidal SSD pass (matmul equals the sequential oracle):

PYTHONPATH=. python companions/ch10/jax/trapezoidal_ssd.py

The PyTorch mirror (integrator and complex state, parity-checked against JAX):

PYTHONPATH=. python companions/ch10/torch/discretization.py
PYTHONPATH=. python companions/ch10/torch/complex_state.py

Run the suites — JAX and Julia in the fast loop, torch separately:

make companion-jax-tests
make companion-julia-tests
make companion-torch-tests

Part III · Beyond SSMs Week 11 Published

Linear attention and Hyena: long convolution and gated recurrence

Linear attention read as a matrix-state linear recurrence — the dual of Chapter 9's scan, entered from the attention side — then gated into the LTV/selectivity face (GLA, RetNet), pivoted to Hyena's implicit LTI long convolution, and landed honestly on the associative-recall capacity limit of any finite state.

Linear attention and Hyena: long convolution and gated recurrence

Chapter 11 — at a glance

Goal: read linear attention as a matrix-valued-state linear recurrence — the exact dual of Chapter 9’s scan, approached from the attention side — then add gating (GLA, RetNet) as the selectivity/LTV move read from that side, pivot to Hyena’s implicit long convolution (an LTI operator in the sense of Chapter 8), and land honestly on where linear attention fails: multi-query associative recall.

Reading time: ~45 minutes prose; ~2 hours with the companions.

Key insight: softmax attention has no fixed-size state; linear attention does. Replacing the score $\exp(q^\top k)$ with a separable surrogate $\phi(q)^\top \phi(k)$ collapses the $O(\seqlen^2)$ attention into an $O(\seqlen)$ recurrence on a matrix state $S_t = \sum_{j \le t} \phi(k_j) v_j^\top$ . That state is finite-rank memory — the same boundedness Chapters 7 and 9 live inside — which is exactly why associative recall eventually breaks (§11.6).

Looking ahead: this is the gateway chapter of the beyond-SSM part. The linear-attention + gating vocabulary built here is the prerequisite for Chapter 12’s delta-rule lineage (the overwrite answer to §11.6’s capacity limit) and for the hybrid and benchmark chapters (14, 16). Ch 11 has no pilot anchor of its own — it is the connective tissue that makes the “beyond-SSM” family legible as one structure.

11.1 From softmax to separable scores

Chapter 9 reached a striking identity from the state-space side: a scalar-state selective SSM is masked linear attention (Theorem 9.5). The scan’s path-ordered taps reassemble into a single lower-triangular matrix $M = L \circ (\outputmat \inputmat^\top)\,\mathrm{diag}(\stepsize)$ , with $L$ a causal decay mask replacing the softmax. This chapter enters that same duality from the other side. We start with attention and rediscover the state space.

Causal attention computes, for each position $t$ ,

y_t = \frac{\sum_{j \le t} \mathrm{sim}(q_t, k_j)\, v_j}{\sum_{j \le t} \mathrm{sim}(q_t, k_j)}, \qquad \mathrm{sim}(q, k) = \exp(q^\top k).

The exponential couples every query to every key: the normalizer cannot be factored across $j$ , so the only way to evaluate it is to materialize all $O(\seqlen^2)$ scores. The move that breaks the coupling Katharopoulos et al. (2020) is to choose a feature map $\phi : \mathbb{R}^{d} \to \mathbb{R}^{d_k}$ with non-negative entries and write the similarity as an inner product of features,

\mathrm{sim}(q, k) = \phi(q)^\top \phi(k) \;\ge\; 0 .

Separability is the whole engine. Because $\phi(q_t)^\top \phi(k_j)$ factors, the numerator and denominator each become a query contracted against a running sum over keys — and a running sum is a state.

11.2 Linear attention as a matrix-state recurrence

Fix $\phi(x) = \mathrm{elu}(x) + 1 > 0$ . Define a matrix state $S_t \in \mathbb{R}^{d_k \times d_v}$ and a normalizer $z_t \in \mathbb{R}^{d_k}$ by

S_t = S_{t-1} + \phi(k_t)\,v_t^\top, \qquad z_t = z_{t-1} + \phi(k_t), \qquad y_t = \frac{S_t^\top \phi(q_t)}{z_t^\top \phi(q_t)},

with $S_0 = 0$ , $z_0 = 0$ . This is a bona fide state-space recurrence: the hidden state is a matrix, the transition is the identity ( $A = I$ , pure accumulation with no forgetting — the LTI face of Chapter 8), and the input is injected as the rank-one outer product $\phi(k_t) v_t^\top$ . It is Chapter 9’s scan with the carry promoted from a vector to a matrix and the decay switched off.

The same operator has a second, parallel face. Stacking $Q_\phi = [\phi(q_1); \dots; \phi(q_\seqlen)]$ and $K_\phi = [\phi(k_1); \dots; \phi(k_\seqlen)]$ , the unnormalized output is a masked matrix product.

Theorem 11.1 (Linear attention: two faces of one operator).

For any feature map $\phi$ , the recurrent form ( $S_t = S_{t-1} + \phi(k_t)v_t^\top$ , output $S_t^\top \phi(q_t)$ ) and the parallel form

Y = \big(L \circ (Q_\phi K_\phi^\top)\big)\, V, \qquad L_{tj} = \mathbb{1}[\,j \le t\,],

with $L$ the all-ones causal mask, compute the identical output. The recurrent form costs $O(\seqlen\, d_k\, d_v)$ time and $O(d_k\, d_v)$ memory; the parallel form costs $O(\seqlen^2 d)$ . This is exactly Theorem 9.5 with the decay mask specialized to ones.

The proof (Exercise 11.4, §11.9) is one line of reassociation: $S_t = \sum_{j \le t} \phi(k_j) v_j^\top$ , so $S_t^\top \phi(q_t) = \sum_{j \le t} (\phi(q_t)^\top \phi(k_j))\, v_j$ , which is precisely row $t$ of $(L \circ (Q_\phi K_\phi^\top))V$ . The mask $L$ supplies the causal cumulative sum that the recurrence performs incrementally. The companion confirms the two routes agree to machine precision in float64: $\max|y_{\mathrm{rec}} - y_{\mathrm{par}}| = 5.6 \times 10^{-16}$ at $\seqlen = 64$ , flat across sequence lengths.

Two panels: left, the maximum difference between recurrent and parallel linear attention versus sequence length, flat near machine epsilon; right, a log-log plot of recurrent O(L) versus parallel O(L-squared) cost. — The two faces of linear attention. Left: the recurrent (scan) and parallel (masked-matmul) forms agree to $\sim 10^{-16}$ in float64 across sequence length — they are the same operator. Right: their costs diverge — recurrent $O(\seqlen)$ versus parallel $O(\seqlen^2)$ — which is why the recurrent form is the inference mode and the parallel form the training mode. Produced by companions/ch11/jax/linear_attention.py; the identity is pinned in tests/test_linear_attention.py::test_recurrent_equals_parallel_normalized.

The normalizer, and why this chapter is float64. The denominator $z_t^\top \phi(q_t)$ is a large-over-large quotient: both $z_t = \sum_{j\le t} \phi(k_j)$ and the numerator grow like $O(t)$ , so the ratio is a difference of accumulations prone to catastrophic cancellation in low precision. The recurrent and parallel forms accumulate that quotient in different orders (sequential cumsum versus a single matmul), so their agreement is a direct probe of arithmetic precision. In float64 they match to $2 \times 10^{-16}$ at $\seqlen = 512$ ; in float32 the same computation caps out at $1.2 \times 10^{-7}$ — a precision gap of nearly $10^{9}$ . The $10^{-12}$ identity pin is simply unreachable in single precision, which is why every companion in this chapter runs float64. (Gated linear attention in §11.3 sidesteps the issue by dropping the normalizer for an explicit decay plus an output normalization — cleaner numerically.)

11.3 Gating: GLA, RetNet, and the LTV reading

Ungated linear attention never forgets. To make it selective — to let the state forget — attach a per-step gate $\gamma_t \in (0,1)^{d_k}$ to the transition:

S_t = \mathrm{diag}(\gamma_t)\, S_{t-1} + \phi(k_t)\, v_t^\top, \qquad y_t = S_t^\top \phi(q_t).

This is the Chapter 9 selectivity move read from the attention side. Two regimes sit inside it:

RetNet Sun et al. (2023) takes $\gamma_t \equiv \gamma$ , a fixed head-wise decay. The transition is constant, so the operator is LTI with forgetting — the Chapter 8 face plus a decay. Its parallel decay mask $L_{tj} = \gamma^{\,t-j}$ is Toeplitz (it depends only on the lag $t - j$ ).
GLA Yang et al. (2024) and HGRN2 Qin et al. (2024) make $\gamma_t = \sigma(W x_t)$ data-dependent. The transition now changes every step, so the operator is LTV — input-dependent forgetting, exactly the selectivity of Chapter 9. Its decay mask $L_{tj} = \prod_{i=j+1}^{t} \gamma_i$ is not Toeplitz; the diagonals vary, just as Chapter 9’s selective decay mask does.

The masked-parallel form generalizes by the GLA “secondary” rescaling: with the cumulative log-gate $g_t = \sum_{i \le t} \log \gamma_i$ , set $\tilde q_t = \phi(q_t) \odot e^{g_t}$ and $\tilde k_j = \phi(k_j) \odot e^{-g_j}$ ; then $\tilde q_t^\top \tilde k_j = \phi(q_t)^\top \mathrm{diag}\!\big(\prod_{i=j+1}^t \gamma_i\big) \phi(k_j)$ is the gated score. This is what makes the duality exact.

Theorem 11.2 (Gated linear attention is masked linear attention with a semiseparable decay mask).

The gated recurrence $S_t = \mathrm{diag}(\gamma_t) S_{t-1} + \phi(k_t)v_t^\top$ , $y_t = S_t^\top \phi(q_t)$ , has a masked-parallel form. For a scalar per-step gate $\gamma_i$ it is exactly $Y = (L_\gamma \circ (Q_\phi K_\phi^\top))\,V$ with the 1-semiseparable decay mask $L_{\gamma,\,tj} = \prod_{i=j+1}^{t} \gamma_i$ , and under $\log \gamma_i = \statemat_i\, \stepsize_i$ this $L_\gamma$ coincides exactly with the decay mask of Theorem 9.5 — gated linear attention and the scalar-state selective SSM are the same operator, read from opposite sides. For a per-feature gate $\gamma_t \in (0,1)^{d_k}$ the decay acts per channel and does not collapse to a single scalar mask; the parallel form instead rescales features by the cumulative gate ( $\tilde q_t = \phi(q_t)\odot e^{g_t}$ , $\tilde k_j = \phi(k_j)\odot e^{-g_j}$ , $g_t = \sum_{i\le t}\log\gamma_i$ ), recovering the same recurrence (§11.9).

The proof is Exercise 11.5 (§11.9): unroll the gated recurrence, contract with $\phi(q_t)$ , and recognize the product-of-gates as the cumulative decay that Theorem 9.4‘s segsum exponentiates. The companion both checks the recurrent–masked identity ( $\max$ difference $2.8 \times 10^{-14}$ for a data-dependent gate) and verifies the cross-chapter bridge against the actual Chapter 9 code: the GLA scalar mask and Chapter 9’s segsum mask agree to $0.0$ (bit-identical) under $\log \gamma_i = \statemat_i \stepsize_i$ .

Two heatmaps of lower-triangular decay masks. Left, RetNet's constant-gamma mask is banded and Toeplitz. Right, GLA's input-dependent mask is lower-triangular but its diagonals visibly vary, so it is not Toeplitz. — The decay mask is where selectivity becomes visible. Left: RetNet's constant gate gives $L_{tj} = \gamma^{t-j}$, a Toeplitz (lag-only) mask — the LTI face. Right: GLA's input-dependent gate gives $L_{tj} = \prod_{i=j+1}^{t}\gamma_i$, whose diagonals vary along the sequence — the LTV face, the same non-Toeplitz structure Chapter 9's selective decay mask shows. Produced by companions/ch11/jax/gated_linear_attention.py; the recurrent–masked identity and the Chapter 9 bridge are pinned in tests/test_gated_linear_attention.py.

11.4 Hyena: the FFT long convolution

The second operator family of this chapter abandons attention entirely — yet it is the same structural idea in new dress, not a digression: Hyena’s convolution is one more instance of the LTI/LTV fork that organized §§11.2–11.3, as §11.5 will make explicit. Hyena Poli et al. (2023) interleaves implicit long convolutions with elementwise multiplicative gating. The convolution is the object to teach: a causal long convolution is a linear time-invariant operator — the convolutional view of an SSM (Proposition 8.1) — computable in $O(\seqlen \log \seqlen)$ by the FFT rather than the $O(\seqlen^2)$ explicit Toeplitz product. For input $u \in \mathbb{R}^{B \times \seqlen \times D}$ and a per-channel filter $k$ ,

y_{b,t,d} = \sum_{s=0}^{t} k_{d,\,t-s}\, u_{b,s,d} + \mathrm{bias}_d\, u_{b,t,d}.

The one subtlety is causality under the cyclic FFT.

Theorem 11.3 (Zero-padding to 2L computes the causal convolution).

Padding $u$ and the filter $k$ to length $2\seqlen$ , multiplying their real FFTs pointwise, inverse-transforming, and truncating to the first $\seqlen$ samples yields exactly the lower-triangular Toeplitz product above. Padding to $2\seqlen$ (in fact any $N \ge 2\seqlen - 1$ ) is sufficient: it sends the cyclic FFT’s wrap-around terms onto the zero-padded tail of $k$ , where they contribute nothing. The un-padded length- $\seqlen$ transform wraps late taps onto early outputs and breaks causality, so some padding past $\seqlen$ is required.

The proof is Exercise 11.6(a) (§11.9). The companion makes both halves concrete: the $O(\seqlen \log \seqlen)$ FFT route reproduces the $O(\seqlen^2)$ explicit-Toeplitz oracle to $2.7 \times 10^{-15}$ in float64 (the predecessor implementation ran float32 at a $10^{-4}$ tolerance — this is a $10^{11}$ tightening), while the un-padded variant disagrees with the oracle by $7.1$ , an $O(1)$ violation of causality.

Two panels: left, the maximum difference between FFT convolution and the explicit Toeplitz reference versus sequence length, flat near machine epsilon; right, a log-log plot of FFT O(L log L) versus Toeplitz O(L-squared) cost. — Hyena's primitive. Left: the $2\seqlen$-padded FFT convolution matches the explicit causal-Toeplitz oracle to $\sim 10^{-15}$ in float64 across sequence length (Theorem, §11.4). Right: the cost separation $O(\seqlen \log \seqlen)$ versus $O(\seqlen^2)$ that motivates the FFT for long filters. Produced by companions/ch11/jax/fftconv.py; the identity and the $2\seqlen$-padding necessity are pinned in tests/test_fftconv.py.

11.5 Implicit and selective filters

What does the long convolution mean in the dynamical-systems vocabulary of this book? Everything hinges on whether the filter is fixed or input-dependent.

When the filter $k$ is static across the batch and sequence, $y = k * u$ is a single linear time-invariant convolution — one kernel, the convolutional face of an SSM (Proposition 8.1). S4 lives exactly here: its kernel $K = (\outputmat \discB, \outputmat \discA \discB, \outputmat \discA^2 \discB, \dots)$ is materialized from a structured continuous-time ODE, and the same FFT machinery of §11.4 applies. Hyena keeps the unrestricted kernel and pays the $\log$ factor, where S4 and Mamba fix the kernel shape — a structured semiseparable matrix (Theorem 9.4) — to buy $O(\seqlen)$ .

When the filter is made input-dependent — the “selective Hyena” variant, where the convolution kernel itself is a function of the input — time-invariance breaks, exactly as Chapter 9’s input-dependent $(\stepsize_t, \inputmat_t, \outputmat_t)$ broke it. No single kernel exists; no convolution theorem applies; the operator rejoins the LTV family of §11.3. This is the same fork seen three times now: a fixed transition is LTI and admits a convolution or a closed-form mask; an input-dependent transition is LTV and admits only the scan. Hyena’s gating, GLA’s decay, and Mamba’s selection are three spellings of one move.

11.6 Counter-evidence: associative recall and capacity

Beyond-SSM does not mean better-than-SSM. The honest place to end is where linear attention loses: multi-query associative recall (MQAR) — store $K$ key–value pairs, then answer queries by retrieving the value bound to a key Arora et al. (2024) . On this task, linear attention degrades as $K$ grows while selective SSMs (Mamba) and softmax attention stay near-perfect. The mechanism is the §11.2 insight turned into a limitation, and it is provable.

Proposition 11.4 (Finite state bounds associative-recall capacity).

The linear-attention state $S = \sum_{i=1}^{K} \phi(k_i)\, v_i^\top$ is a sum of $K$ rank-one matrices, so $\operatorname{rank} S \le \min(K, d_k, d_v)$ . If $K > d_k$ , the feature vectors $\{\phi(k_i)\} \subset \mathbb{R}^{d_k}$ are linearly dependent, so the readout $S^\top \phi(k_j)$ cannot reproduce $K$ linearly independent values: there exist queries for which exact recall is impossible, regardless of $\phi$ . Capacity is bounded by the feature dimension.

The proof is Exercise 11.6(b) (§11.9). The companion turns the bound into a measured fact (a fixed-weight mechanism, not a trained model — the trained-model accuracy curves are cited to Zoology Arora et al. (2024) ). Storing $K$ random unit keys with orthonormal values and reading each back, the per-binding retrieval error $\frac{1}{K}\sum_j \lVert \hat v_j - v_j \rVert$ behaves exactly as the rank bound predicts:

Below capacity, exact. With $K \le d_k$ orthonormal keys the features are independent, the state is full-rank, and recall is exact — error $\sim 5 \times 10^{-16}$ .
Past capacity, degrading. With generic keys the error grows like $\sqrt{K/d_k}$ : at $K = 64$ it is $1.99$ for $d_k = 16$ and $0.98$ for $d_k = 64$ (capacity scales with the state size), rising to $2.81$ at $K = 128$ .
The oracle stays exact. Softmax attention’s exponential sharpening concentrates the readout on the self-match, so its error is $0.000$ independent of $K$ — the capacity-unbounded baseline that linear attention cannot match.

Retrieval error versus number of key-value pairs K. Two linear-attention curves, for state size 16 and 64, rise as K grows, with the larger state rising more slowly; the softmax oracle curve stays flat at zero. — The capacity wall (fixed-weight mechanism, not a trained model). Linear-attention retrieval error grows like $\sqrt{K/d_k}$ as the number of stored pairs exceeds the state size; enlarging the feature dimension lowers it (here $d_k$: $16 \to 64$ cuts the error in half), but no finite state escapes the trend. Softmax attention stays at zero error for all $K$. The structural cause is the §11.6 rank bound: a rank-$d_k$ state cannot hold unboundedly many independent associations. Produced by companions/ch11/jax/mqar_recall.py; pinned in tests/test_mqar_recall.py (exact sub-capacity recall, growth with $K$, and capacity scaling with $d_k$).

Gating (§11.3) helps at the margin — a decay lets the state shed stale bindings — but it reweights a finite state, it does not raise the rank ceiling. Closing the gap needs either a genuinely larger structured state (the selective-SSM route of Chapters 9–10) or a smarter write rule, which is the subject of Chapter 12.

11.7 What’s next

This chapter built the linear-attention half of the “beyond-SSM” family: an additive matrix-state recurrence ( $A = I$ ), gated into the LTV/selectivity face (GLA, RetNet), with Hyena’s implicit LTI long convolution as the sibling operator and a hard capacity limit (§11.6) at the bottom.

Chapter 9 promised this chapter would develop “linear attention and the delta-rule lineage” as the attention-side reading of structured contractions. We have built the linear-attention half; the delta-rule half sharpens exactly the §11.6 capacity limit. Instead of blindly accumulating $\phi(k)v^\top$ , the delta rule overwrites the value currently associated with a key — a rank-one correction

S_t = \big(I - \beta_t\, \phi(k_t)\phi(k_t)^\top\big) S_{t-1} + \beta_t\, \phi(k_t) v_t^\top,

turning the identity-transition recurrence into a genuinely data-dependent write. That is Chapter 12 (the online-learning reading: DeltaNet as explicit-Euler, Longhorn as backward-Euler). Chapter 13 adds exponential gates and matrix memory (xLSTM, RWKV-7); Chapter 14 mixes these layers with attention into the production hybrids. The structure is one family; we are still mapping its faces.

11.8 Exercises

Three short problems (solutions inline) and three longer ones (solutions in §11.9).

Exercise 11.1 (short)

Take $d_k = 2$ , $\phi = \mathrm{elu}{+}1$ , and two key–value pairs $(k_1, v_1), (k_2, v_2)$ with scalar values. Write $S_2$ explicitly as a sum of two outer products, and verify by hand that the unnormalized readout $S_2^\top \phi(q)$ for a query $q$ equals row $t{=}2$ of the masked-attention form $(L \circ (Q_\phi K_\phi^\top))V$ .

Solution

$S_2 = \phi(k_1)v_1^\top + \phi(k_2)v_2^\top$ (a $2 \times 1$ matrix since $v$ is scalar). Then $S_2^\top \phi(q) = v_1\,\phi(k_1)^\top\phi(q) + v_2\,\phi(k_2)^\top \phi(q) = \sum_{j \le 2} (\phi(q)^\top \phi(k_j))\, v_j$ . Row $t{=}2$ of the masked form is $\sum_j L_{2j}(\phi(q_2)^\top\phi(k_j))v_j$ with $L_{2j} = 1$ for $j \in \{1, 2\}$ — identical (taking $q = q_2$ ). The outer-product sum and the masked row are two evaluations of the same contraction.

Exercise 11.2 (short, code)

Run companions/ch11/jax/linear_attention.py. Confirm the recurrent and parallel forms agree to $< 10^{-12}$ , and explain in one sentence why the mask $L$ here is all ones below the diagonal, whereas Chapter 9’s mask carried a decay.

Solution

The script prints $\max|y_{\mathrm{rec}} - y_{\mathrm{par}}| \approx 5.6 \times 10^{-16}$ . The mask is all-ones because ungated linear attention has transition $A = I$ : nothing decays between positions, so the weight on key $j$ at query $t$ is $1$ for every $j \le t$ . Chapter 9’s mask carried $\exp(\statemat \sum \stepsize)$ because its transition $\discA_t = e^{\statemat \stepsize_t} < 1$ decays the contribution of older positions. Gating (§11.3) restores that decay.

Exercise 11.3 (short)

Show that a constant gate $\gamma_t \equiv \gamma$ reduces the GLA decay mask to RetNet’s $L_{tj} = \gamma^{\,t-j}$ , and explain in one line why a constant gate makes the mask Toeplitz while an input-dependent $\gamma_t$ does not.

Solution

$L_{tj} = \prod_{i=j+1}^{t} \gamma_i = \prod_{i=j+1}^{t} \gamma = \gamma^{\,t-j}$ . This depends only on the lag $t - j$ , so $L$ is constant along each diagonal — Toeplitz. With $\gamma_t$ input-dependent the product $\prod_{i=j+1}^t \gamma_i$ depends on which steps are spanned, not just how many, so equal-lag entries on a diagonal differ — the mask is no longer Toeplitz. (The companion’s RetNet mask has Toeplitz residual exactly $0$ .)

Exercise 11.4 (theory) — solution in §11.9

Prove the recurrent–parallel equivalence of §11.2: the recurrent matrix-state form and the masked-parallel form compute the same output, and state the time/memory cost of each.

Exercise 11.5 (theory) — solution in §11.9

Prove the gated-linear-attention duality of §11.3, and identify the exact correspondence $\log \gamma_i = \statemat_i \stepsize_i$ under which the scalar-gate GLA mask coincides with the Chapter 9 decay mask of Theorem 9.5.

Exercise 11.6 (theory) — solution in §11.9

(a) Prove the FFT-causality theorem of §11.4 — that padding past $\seqlen$ (to $2\seqlen$ ) lets the FFT compute the causal convolution, while the un-padded transform does not. (b) Prove the rank bound of the §11.6 capacity proposition, and connect it to the $\sqrt{K/d_k}$ growth of retrieval error in the §11.6 capacity figure.

11.9 Full solutions to theory exercises

Solution to Exercise 11.4

Unrolling the recurrence from $S_0 = 0$ gives $S_t = \sum_{j \le t} \phi(k_j) v_j^\top$ . Hence

S_t^\top \phi(q_t) = \sum_{j \le t} v_j\, \phi(k_j)^\top \phi(q_t) = \sum_{j \le t} \big(\phi(q_t)^\top \phi(k_j)\big)\, v_j .

Row $t$ of the parallel form is $[(L \circ (Q_\phi K_\phi^\top))V]_t = \sum_j L_{tj}\,(\phi(q_t)^\top \phi(k_j))\, v_j = \sum_{j \le t}(\phi(q_t)^\top\phi(k_j)) v_j$ , since $L_{tj} = \mathbb{1}[j \le t]$ . The two are identical. The recurrent form stores only $S_t \in \mathbb{R}^{d_k \times d_v}$ and does $O(d_k d_v)$ work per step: $O(\seqlen d_k d_v)$ time, $O(d_k d_v)$ memory. The parallel form materializes the $\seqlen \times \seqlen$ score matrix: $O(\seqlen^2 d)$ time. $\blacksquare$

Solution to Exercise 11.5

Unrolling the gated recurrence,

S_t = \sum_{j \le t} \Big(\textstyle\prod_{i=j+1}^{t} \mathrm{diag}(\gamma_i)\Big) \phi(k_j) v_j^\top = \sum_{j \le t} \mathrm{diag}\!\Big(\textstyle\prod_{i=j+1}^{t}\gamma_i\Big)\phi(k_j)v_j^\top .

Contracting with $\phi(q_t)$ , $\;y_t = \sum_{j\le t} v_j \sum_c \phi(q_t)_c \big(\prod_{i=j+1}^t \gamma_{i,c}\big)\phi(k_j)_c$ . Writing $g_{t,c} = \sum_{i\le t} \log\gamma_{i,c}$ gives $\prod_{i=j+1}^t \gamma_{i,c} = e^{g_{t,c}-g_{j,c}}$ , so with $\tilde q_t = \phi(q_t)\odot e^{g_t}$ and $\tilde k_j = \phi(k_j)\odot e^{-g_j}$ we get $y_t = \sum_{j\le t}(\tilde q_t^\top \tilde k_j) v_j = [(L_\gamma \circ (Q_\phi K_\phi^\top))V]_t$ with $L_{\gamma,tj} = \prod_{i=j+1}^t \gamma_i$ . For a scalar gate $\gamma_{i,c} = \gamma_i$ , the mask is $L_{\gamma,tj} = \prod_{i=j+1}^t \gamma_i$ ; under $\log\gamma_i = \statemat_i \stepsize_i$ this is $\exp(\sum_{i=j+1}^t \statemat_i \stepsize_i)$ , exactly the Theorem 9.5 mask $L_{tj} = \exp(\statemat \sum_{i=j+1}^t \stepsize_i)$ . The off-diagonal blocks of $L_\gamma$ have rank one (a product of a column $e^{g_t}$ and a row $e^{-g_j}$ ), so $L_\gamma$ is 1-semiseparable. The companion confirms the masks are bit-identical ( $0.0$ difference). $\blacksquare$

Solution to Exercise 11.6

(a) Zero-pad $u, k$ to length $N = 2\seqlen$ . The pointwise product of their DFTs is the cyclic convolution $(u \circledast_N k)[n] = \sum_m u[m]\,k[(n-m)\bmod N]$ . For $0 \le n < \seqlen$ and $0 \le m < \seqlen$ (both zero beyond $\seqlen - 1$ ), the index $n - m$ ranges over $-(\seqlen{-}1) \dots (\seqlen{-}1)$ . A negative index wraps to $N + (n-m) \in [\seqlen{+}1,\,2\seqlen{-}1]$ , where the padded $k$ is zero — so wrapped terms contribute nothing, leaving $(u\circledast_N k)[n] = \sum_{m\le n} u[m]k[n-m]$ , the causal linear convolution. With $N = \seqlen$ (no padding) the same wrap sends $m > n$ to $\seqlen + (n-m) \in [1, \seqlen{-}1]$ , where $k$ is nonzero, injecting acausal terms. Hence any $N \ge 2\seqlen - 1$ suffices (the companion uses $2\seqlen$ ), while $N = \seqlen$ fails. The companion’s un-padded variant differs from the oracle by $7.1$ . $\blacksquare$

(b) $S = \sum_{i=1}^K \phi(k_i) v_i^\top$ is a sum of $K$ rank-one matrices, so $\operatorname{rank} S \le K$ ; it is also $\le d_k$ (its columns lie in $\mathbb{R}^{d_k}$ ) and $\le d_v$ . Thus $\operatorname{rank} S \le \min(K, d_k, d_v)$ . If $K > d_k$ , the $K$ vectors $\phi(k_i) \in \mathbb{R}^{d_k}$ are linearly dependent: some $\phi(k_m) = \sum_{i\ne m} c_i \phi(k_i)$ , so the readout $\hat v_m = S^\top\phi(k_m)$ is a fixed linear combination of the readouts for the other keys and cannot independently equal $v_m$ for arbitrary value assignments. Exact recall of $K$ independent bindings is therefore impossible. The $\sqrt{K/d_k}$ growth in the §11.6 capacity figure is the quantitative shadow of this: with random unit keys the interference $\sum_{i\ne j}\langle k_j, k_i \rangle v_i$ has $\sim K$ terms each of size $\sim 1/\sqrt{d_k}$ , so its norm scales as $\sqrt{K/d_k}$ — zero only when the keys are orthonormal, which requires $K \le d_k$ . $\blacksquare$

11.10 Companion code

The companions live in companions/ch11/{jax,torch,julia} and are float64 throughout (§11.2 explains why). The JAX modules are canonical and produce every figure; the torch modules mirror them and are checked for cross-framework parity to $< 10^{-9}$ ; the Julia module ports Hyena’s convolution to a third language using only the standard library.

# JAX (canonical; emits the four figures to public/figures/ch11/)
PYTHONPATH=. python companions/ch11/jax/linear_attention.py
PYTHONPATH=. python companions/ch11/jax/gated_linear_attention.py
PYTHONPATH=. python companions/ch11/jax/fftconv.py
PYTHONPATH=. python companions/ch11/jax/mqar_recall.py

# Tests: JAX identities (< 1e-12), torch parity (< 1e-9), Julia FFT identity
.venv/bin/pytest companions/ch11/jax companions/ch11/torch -q
julia --project=companions/ch11/julia companions/ch11/julia/runtests.jl

linear_attention.py — the matrix-state recurrence and its masked-parallel twin (§11.2), the capacity-rank object (§11.6), and the float64-vs-float32 normalizer demonstration.
gated_linear_attention.py — GLA/RetNet gating, the decay masks, and the cross-chapter bridge to Theorem 9.5 built on Chapter 9’s own segsum (§11.3).
fftconv.py — Hyena’s $2\seqlen$ -padded FFT convolution against the explicit Toeplitz oracle, plus the un-padded variant that fails causality (§11.4). The torch port documents the buffers-vs-Parameters distinction (a fixed mask is a buffer; a learned filter is a Parameter).
mqar_recall.py — the fixed-weight associative-recall mechanism behind §11.6: exact recall below capacity, $\sqrt{K/d_k}$ degradation above it, and the softmax oracle at zero error.

Part III · Beyond SSMs Week 12 Published

Delta-rule lineage: DeltaNet, Gated DeltaNet, Kimi Linear

The delta rule read as one gradient step on an associative-recall loss — DeltaNet (explicit Euler) and Longhorn (backward Euler) as the textbook explicit/implicit discretization pair, the stability dichotomy that follows, chunkwise parallelization via the WY trick, and gated forgetting from Gated DeltaNet to Kimi Linear.

Delta-rule lineage: DeltaNet, Gated DeltaNet, Kimi Linear

Chapter 12 — at a glance

Goal: read the state update of a sequence layer as online learning — each step is one update of an optimizer minimizing an associative-recall loss. Under that lens DeltaNet is one explicit (forward-Euler) gradient step, Longhorn is the implicit (backward-Euler) step on the same gradient flow, and the entire explicit/implicit stability theory of Chapters 5–6 transfers verbatim: DeltaNet has a finite stability interval, Longhorn cannot be destabilized. Chunkwise parallelization (the WY trick) and gated forgetting (Gated DeltaNet → Kimi Linear) complete the lineage.

Reading time: ~45 minutes prose; ~2 hours with the companions.

Key insight: a state update is an optimizer, and the architecture choice is a discretization choice. The delta rule $S_t = S_{t-1} + \beta_t (v_t - S_{t-1}k_t)k_t^\top$ is literally one gradient step on $\tfrac{1}{2}\|S k_t - v_t\|^2$ — so the question “is this recurrence stable?” is the question “is this step size inside the method’s stability region?”, and the answer is the same dichotomy the book met for ODE integrators: explicit steps buy cheapness and risk blow-up; implicit steps buy unconditional stability.

Looking ahead: this chapter is B-pilot load-bearing. The explicit-vs-implicit analysis of the online-learning step is the mathematical spine of pilot B’s gating analysis, and §12.6’s two-limit reading of the attention↔SSM family seeds the singular-perturbation framing the benchmark chapters (14, 16) will use.

12.1 The recall objective and its gradient flow

Chapter 11 ended at a wall (Proposition 11.4): the additive state $S = \sum_i \phi(k_i)v_i^\top$ accumulates and never forgets, so every stored pair interferes with every retrieval through the key overlaps $k_i^\top k_j$ , and no finite state escapes the trend. The closing promise was a smarter write rule. To derive one, stop postulating recurrences and instead ask what the state is for.

The state of every architecture in this part of the book is an associative memory: it should return $v$ when probed with $k$ . Make that a loss. At time $t$ , with the incoming pair $(k_t, v_t)$ ,

\mathcal{L}_t(S) \;=\; \tfrac{1}{2}\,\bigl\|S k_t - v_t\bigr\|^2 .

Two conventions before anything else. First, shapes: this chapter writes the state as $S \in \mathbb{R}^{d_v \times d_k}$ with retrieval $S k$ — the row-state form used by the DeltaNet literature and by Chapter 3’s low-rank update discussion. Chapter 11 wrote the transposed form ( $S \in \mathbb{R}^{d_k \times d_v}$ , projector acting on the left); the map $S \mapsto S^\top$ converts one to the other, and every claim below transposes with it. Second, the feature map: the delta-rule literature works with the keys directly (typically $\ell_2$ -normalized) rather than through a kernel feature map $\phi$ ; we follow that and write $k_t$ for what Chapter 11 would call $\phi(k_t)$ . Nothing in this chapter depends on the distinction — and the normalization choice will turn out to be a stability decision (§12.4).

Gradient descent on $\mathcal{L}_t$ in continuous time is the recall gradient flow

\dot S \;=\; -\nabla_S \mathcal{L}_t(S) \;=\; (v_t - S k_t)\,k_t^\top ,

a linear (affine) matrix ODE. Its equilibrium is the rank-one matrix that retrieves perfectly:

Proposition 12.1 (The recall flow and its fixed point).

For a fixed pair $(k, v)$ with $k \ne 0$ , the gradient flow $\dot S = (v - Sk)k^\top$ has fixed-point set $\{S : Sk = v\}$ , and the minimum-norm equilibrium reachable from $S_0 = 0$ is

S^\star \;=\; \frac{v\,k^\top}{\|k\|^2}, \qquad S^\star k = v .

The flow contracts the deviation $E = S - S^\star$ in the $k$ direction at rate $\|k\|^2$ and leaves the orthogonal complement untouched: directions the memory has never been asked about are neither corrected nor forgotten.

The proof is a two-line computation with the substitution $E = S - S^\star$ (Exercise 12.5 carries it through the discretizations). The inert orthogonal complement is a feature, not a defect — it is persistent memory, and §12.6 will need a gate precisely because gradient flow on the current pair never cleans up stale directions on its own.

Everything in this chapter is now one sentence: DeltaNet, Longhorn, and their gated descendants are different one-step discretizations of this flow, interleaved over a stream of pairs $(k_t, v_t)$ . The architecture zoo becomes an integrator catalog, and the book’s Part-I machinery applies.

12.2 DeltaNet: one explicit gradient step

Take one forward-Euler step of size $\beta_t$ on the recall flow (Chapter 4’s explicit workhorse), at the incoming pair:

Proposition 12.2 (The delta rule, two algebraic faces).

One explicit gradient step $S_t = S_{t-1} + \beta_t\,(v_t - S_{t-1}k_t)\,k_t^\top$ is identical to the erase-then-write form

S_t \;=\; S_{t-1}\,\bigl(I - \beta_t\,k_t k_t^\top\bigr) \;+\; \beta_t\,v_t k_t^\top .

The first form costs one mat-vec plus one rank-one update, $O(d_v d_k)$ per step, and never materializes the $d_k \times d_k$ projector; the second exposes the mechanism — selective erasure along $k_t$ , then a fresh rank-one write.

The identity is one distribution of terms: $S_{t-1}(I - \beta k k^\top) + \beta v k^\top = S_{t-1} + \beta(v - S_{t-1}k)k^\top$ . Chapter 3 previewed exactly this update as the recurring “low-rank correction” pattern — for $\beta_t \|k_t\|^2 < 1$ the factor $(I - \beta_t k_t k_t^\top)$ is a contraction toward the hyperplane $k_t^\perp$ , and at $\beta_t \|k_t\|^2 = 1$ it is the orthogonal projector onto it. With $\beta_t \in (0,1]$ a learned, input-dependent write strength and the read $o_t = S_t q_t$ , this recurrence is DeltaNet’s state update Yang et al. (2024) , the delta rule of fast-weight programming Schlag et al. (2021) made into a sequence layer. The companion’s lax.scan (rank-one form) and a deliberately different materialized-projector loop agree to $1.8 \times 10^{-15}$ over 48 steps — the two faces are one operator.

What did the erase term buy over Chapter 11’s accumulation? The defining semantics: re-storing a key replaces its value. Write $(k, v_1)$ , then $(k, v_2)$ , with $\beta = 1$ on a unit key. The erase factor annihilates the old binding before the new write lands, so the delta-rule state retrieves $S k = v_2$ exactly — measured residual $4.4 \times 10^{-16}$ — while the additive state retrieves $v_1 + v_2$ , off by the entire stale residue (measured $\ell_\infty$ error $1.27$ on the same data). An additive memory can only pile bindings on top of each other; a delta-rule memory updates them.

Two panels. Left: mean retrieval error versus number of stored pairs K for random unit keys; the additive curve rises above the delta-rule curve, both increasing past the state dimension marked at K equals 32. Right: the same comparison with orthonormal keys; both curves sit at machine precision below the 1e-12 pin line. — What the erase term buys. Left: with random unit keys ($d_k = 32$), sequential delta-rule writes ($\beta = 1$) retrieve with consistently lower interference error than the additive state of Chapter 11 — at $K = 32$ stored pairs, mean $\ell_\infty$ error $1.44$ versus $1.997$ — but both still grow with $K$: erasure reduces interference, it does not raise the rank ceiling of Chapter 11's capacity bound (Proposition 11.4). Right: with orthonormal keys both rules are exact to $\sim 10^{-15}$ — there is no interference to erase, which isolates interference (not capacity) as the thing the delta rule fixes. Produced by companions/ch12/jax/delta_rule.py; pinned in tests/test_delta_rule.py (overwrite exactness, orthonormal exactness, delta-below-additive).

The figure is deliberately honest about the limit. With orthonormal keys both rules are exact — the delta rule’s advantage is zero when keys do not overlap — and with random keys both error curves still grow as $K$ passes $d_k$ : a rank- $d_k$ state cannot hold more than $d_k$ independent associations no matter how cleverly it writes (Proposition 11.4 still binds). The delta rule answers Chapter 11’s interference problem and its staleness problem (the overwrite), not the capacity bound. That distinction is exactly where Chapter 13’s matrix-memory architectures and Chapter 14’s hybrids will pick up.

12.3 Longhorn: the implicit step

Forward Euler is not the only one-step method, and Chapter 6 taught the alternative. Longhorn Liu et al. (2024) reaches it from the optimization side: frame the update as amortized online learning and solve, at each step, the regularized problem

S_t \;=\; \arg\min_S\; \tfrac{1}{2}\,\|S k_t - v_t\|^2 \;+\; \tfrac{\alpha_t}{2}\,\|S - S_{t-1}\|_F^2 ,

a trust-region (proximal) step: improve recall on the incoming pair, but do not move far from the state you have. Setting the gradient to zero gives

\alpha_t\,(S_t - S_{t-1}) \;+\; (S_t k_t - v_t)\,k_t^\top \;=\; 0 \quad\Longleftrightarrow\quad S_t \;=\; S_{t-1} + \tfrac{1}{\alpha_t}\,(v_t - S_t k_t)\,k_t^\top .

Read the right-hand form against §12.1: this is the recall gradient evaluated at the endpoint $S_t$ , with step size $h = 1/\alpha_t$ — the backward-Euler step on the recall flow, the implicit method of Chapter 6.

An implicit equation usually costs a solve. Here the solve is free:

Theorem 12.3 (Longhorn's implicit step in closed form).

For $\alpha_t > 0$ the backward-Euler equation above has the unique solution

S_t \;=\; S_{t-1} \;+\; \underbrace{\frac{1}{\alpha_t + \|k_t\|^2}}_{\textstyle\beta_t^{\mathrm{eff}}} \,(v_t - S_{t-1}k_t)\,k_t^\top ,

i.e. the implicit step is the delta rule evaluated at the self-limiting rate $\beta_t^{\mathrm{eff}} = 1/(\alpha_t + \|k_t\|^2)$ . In particular $\beta_t^{\mathrm{eff}} \le 1/\alpha_t$ and $\beta_t^{\mathrm{eff}}\|k_t\|^2 = \|k_t\|^2/(\alpha_t + \|k_t\|^2) < 1$ for every key, however large.

The proof (Exercise 12.4, §12.9) is the classic implicit-solve move: right- multiply the stationarity equation by $k_t$ , solve the resulting scalar equation for the post-update prediction $S_t k_t$ , and substitute back. In the companion the identification is structural — longhorn_step simply is delta_rule_step evaluated at $\beta^{\mathrm{eff}}$ — so the closed form gets an independent certificate instead: solving the stationarity system $S_t(\alpha I + k k^\top) = \alpha S_{t-1} + v k^\top$ by dense linear solve (no code shared with the rank-one form) reproduces it to $4.4 \times 10^{-16}$ , and the closed form zeroes the stationarity residual to $7.8 \times 10^{-16}$ (pinned $< 10^{-12}$ in tests/test_longhorn.py::test_closed_form_equals_dense_implicit_solve). Explicit and implicit methods on this flow live in the same parametric family — they differ only in how the effective step size responds to the data. DeltaNet’s $\beta_t$ is whatever the network learned to emit; Longhorn’s $\beta_t^{\mathrm{eff}}$ self-adapts to the key magnitude. Scaling the key by $10^2$ pushes the product $\beta^{\mathrm{eff}}\|k\|^2$ to $1 - 1.7\times 10^{-5}$ ; by $10^4$ , to $1 - 1.7\times 10^{-9}$ — approaching but provably never reaching $1$ . That denominator is the entire difference between the two architectures, and §12.4 shows it is exactly the difference between conditional and unconditional stability.

12.4 Stability regions: the explicit/implicit dichotomy

Both updates iterate an affine map in $S$ , so no linearization is needed: with $S^\star = v k^\top / \|k\|^2$ and $E_t = S_t - S^\star$ , a step at the (for now, repeated) pair $(k, v)$ gives exactly

E_t \;=\; E_{t-1}\,\bigl(I - \beta^{\mathrm{eff}}\,k k^\top\bigr) .

The iteration matrix is a rank-one perturbation of the identity; its spectrum is immediate.

Theorem 12.4 (The stability dichotomy).

The matrix $I - \beta^{\mathrm{eff}} kk^\top$ has eigenvalue $1 - \beta^{\mathrm{eff}}\|k\|^2$ in the $k$ direction and $1$ on the orthogonal complement. The deviation therefore contracts in the only direction the update touches iff $\rho_k = |1 - \beta^{\mathrm{eff}}\|k\|^2| < 1$ , and:

DeltaNet (explicit, $\beta^{\mathrm{eff}} = \beta$ ): stable iff $\beta\|k\|^2 \in (0, 2)$ . Past the boundary $\beta\|k\|^2 = 2$ the deviation alternates sign and grows geometrically — forward Euler leaving its stability interval, Chapter 5’s picture replayed.
Longhorn (implicit, $\beta^{\mathrm{eff}} = 1/(\alpha + \|k\|^2)$ ): $\rho_k = \alpha/(\alpha + \|k\|^2) < 1$ for every $\alpha > 0$ and every key. Unconditional stability — the backward-Euler guarantee of Proposition 6.1, transferred to the online-learning flow. Moreover $\rho_k \to 0$ as $\|k\|^2/\alpha \to \infty$ : the hardest writes are the most strongly contracted, the L-stability signature.

Two panels. Left: DeltaNet spectral radius as a V-shaped curve in beta times key norm squared, dipping to zero at 1 and crossing above one at the boundary 2, with the unstable region shaded. Right: Longhorn spectral radius versus key magnitude over alpha on a log axis, decreasing from one toward zero and never crossing one. — The stability dichotomy, plotted from the closed forms of Theorem 12.4. Left: DeltaNet's $\rho_k = |1 - \beta\|k\|^2|$ — stable only on $(0, 2)$; at $\beta\|k\|^2 = 2.5$ the radius is $1.5$ and the iteration diverges. Right: Longhorn's $\rho_k = \alpha/(\alpha + \|k\|^2)$ stays below $1$ for every key magnitude and decays to $0$ ($\rho_k = 0.0099$ at $\|k\|^2/\alpha = 100$): unconditional stability with L-stable strong contraction of large writes. Produced by companions/ch12/jax/stability.py; the analytic curves match the Rayleigh quotient of the materialized iteration matrix to a measured $6.7 \times 10^{-16}$ max drift, pinned $< 10^{-12}$ in tests/test_stability.py.

Because the deviation recurrence is exact — not a linearization — the geometric law is measurable to machine precision. Starting from $S_0 = 0$ under a repeated unit-norm key, the per-step ratio $\|E_t\|_F / \|E_{t-1}\|_F$ measured from the companion equals the analytic radius to twelve digits: $1.500000000000$ for DeltaNet at $\beta\|k\|^2 = 2.5$ , $0.500000000000$ for Longhorn at $\alpha = \|k\|^2$ . The trajectories make the dichotomy visceral:

Semilog plot of normalized deviation from the fixed point versus step count. Three DeltaNet curves: geometric decay at rate 0.5, slow decay at 0.9, and geometric growth at 1.5 crossing above the reference line at one. One dashed Longhorn curve decays at rate 0.5. — Exact geometric trajectories under a repeated pair, one curve per effective step size; each legend entry quotes the analytic radius $\rho$, and the measured early-step ratios reproduce it to twelve digits (pinned at $10^{-12}$ above the float noise floor in test_longhorn.py::test_error_decay_ratio_equals_analytic_rho). DeltaNet at $\beta\|k\|^2 = 2.5$ diverges exactly as forward Euler past its boundary; Longhorn at $\alpha = \|k\|^2$ contracts at $\rho = 0.5$ and cannot be pushed past $1$ by any key. Produced by companions/ch12/jax/longhorn.py.

One degenerate-looking feature of Theorem 12.4 deserves emphasis: the orthogonal eigenvalue is exactly $1$ for both methods. Neither integrator forgets directions it has not been asked about — stability here is about the write direction only. Whole-state forgetting needs a different mechanism (§12.6).

Why trained DeltaNets are stable anyway. The dichotomy explains a design detail that looks cosmetic in the paper and is load-bearing in the lens: Yang et al. $\ell_2$ -normalize keys and parameterize $\beta_t = \mathrm{sigmoid}(\cdot) \in (0,1)$ Yang et al. (2024) . With $\|k_t\| = 1$ that pins $\beta_t\|k_t\|^2 < 1$ — not merely inside the stability interval $(0,2)$ but inside its monotonically-contracting half. And the per-step guarantee composes, which discharges the repeated-pair scoping above: with $\beta_t\|k_t\|^2 < 1$ every factor $I - \beta_t k_t k_t^\top$ has eigenvalues in $(0, 1]$ , so the streaming product over any key sequence is non-expansive — the repeated pair is the worst case, not a special case. The architecture’s normalization choices are its stability guarantee, chosen at design time the way a step-size rule is chosen for an integrator. Longhorn needs no such guard, and that is precisely its selling point: stability by method, not by parameterization — the implicit-method trade Chapter 6 priced out, here with a free solve.

12.5 Chunkwise parallelization: the WY trick

The recurrent delta rule is $O(\seqlen)$ sequential — fine for inference, hostile to training hardware. Chapter 11 met the same tension and resolved it with a second face of the same operator (Theorem 11.1); the delta rule’s resolution is the same move with one extra piece of numerical linear algebra. Split the sequence into chunks of size $C$ and unroll the erase-then-write form across one chunk:

S_{\mathrm{end}} \;=\; S_{\mathrm{entry}}\,P \;+\; R, \qquad P \;=\; \prod_{t \in \mathrm{chunk}} \bigl(I - \beta_t k_t k_t^\top\bigr),

with $R$ the chunk’s accumulated writes — the chunk’s own §12.2 recurrence run from a zero entry state, which is exactly how the companion materializes it. The cross-chunk recurrence is now one affine update per chunk ( $\seqlen/C$ sequential steps) and everything inside a chunk touches only chunk-local data Yang et al. (2024) . The remaining cost is the erase product $P$ — naively a chain of $d_k \times d_k$ matrix products. But $P$ is a product of rank-one perturbations of $I$ , and numerical linear algebra has compressed exactly this object for decades:

Theorem 12.5 (WY representation of the erase product).

For keys $k_1, \dots, k_C$ and rates $\beta_1, \dots, \beta_C$ , the chunk erase product admits the compact form

\prod_{t=1}^{C}\bigl(I - \beta_t k_t k_t^\top\bigr) \;=\; I - W^\top Y,

where the rows of $Y \in \mathbb{R}^{C \times d_k}$ are the keys and the rows of $W$ obey the recursion $w_t = \beta_t\bigl(k_t - \sum_{j<t} (k_j^\top k_t)\, w_j\bigr)$ . Applying $P$ to a state is then two GEMMs, $SP = S - (SW^\top)Y$ , never a materialized $d_k \times d_k$ product chain.

The induction (Exercise 12.6, §12.9) is three lines. The companion pins both identities at machine precision on a stable-regime stream (unit-norm keys, $\beta_t < 1$ — §12.4’s recommended operating point): the WY form matches the explicitly multiplied product to $3.7 \times 10^{-16}$ at worst across chunk sizes $C = 2$ through $64$ , and the chunkwise driver — cross-chunk state passing with chunk-local sweeps — reproduces the monolithic recurrence bitwise (measured difference $0.0$ for every chunk size $C \mid \seqlen$ , because both run the same scan arithmetic in the same order). As in Chapter 11, the equivalence is the correctness certificate for the production form: one operator, two computation schedules.

Two semilog panels versus chunk size on a log-2 axis. Left: maximum difference between chunkwise and recurrent outputs, flat at the display floor far below the 1e-12 pin line. Right: maximum difference between the WY form and the explicitly multiplied erase product, flat near machine epsilon, far below the pin line. — Both §12.5 equivalences across chunk sizes at $\seqlen = 64$ (float64, stable-regime stream: unit-norm keys, $\beta_t < 1$). Left: the chunkwise driver equals the monolithic recurrence exactly — measured difference $0.0$, plotted at the $10^{-18}$ display floor — for every $C$ from $1$ to $\seqlen$. Right: $I - W^\top Y$ equals the materialized product $\prod_t(I - \beta_t k_t k_t^\top)$ at machine-epsilon scale ($\le 3.7 \times 10^{-16}$ for every plotted $C$); with wildly unstable rates ($\beta\|k\|^2 \gg 2$) the product's entries — and the absolute error with them — would grow exponentially, one more reason §12.4's regime is the operating point. Produced by companions/ch12/jax/chunkwise.py; pinned $< 10^{-12}$ for every plotted $C$ in tests/test_chunkwise.py.

The practical punchline mirrors Chapter 11’s: the recurrent form is the inference mode, the chunkwise form is the training mode, and a fused kernel is an engineering refinement of an identity that is already exact in pure JAX.

12.6 Gating the delta rule: Gated DeltaNet and Kimi Linear

Theorem 12.4 left one direction untouched: everything orthogonal to the current key sits at eigenvalue exactly $1$ . A delta-rule memory never spontaneously forgets — stale bindings persist until their key is revisited. Chapter 9’s selective SSMs had the complementary skill: a scalar decay $\discA_t$ that forgets everything, uniformly. Gated DeltaNet Yang et al. (2025) composes the two:

S_t \;=\; \gamma_t\,S_{t-1}\bigl(I - \beta_t k_t k_t^\top\bigr) \;+\; \beta_t\,v_t k_t^\top, \qquad \gamma_t \in (0, 1] .

The gate multiplies the erased state; the fresh write lands ungated. Both parents are exact limits, and the companion pins both reductions: $\gamma_t \equiv 1$ recovers plain DeltaNet to $1.8 \times 10^{-15}$ , and $\beta_t \equiv 0$ is pure exponential decay $S_t = \gamma_t S_{t-1}$ — Chapter 9’s scalar-decay contraction with the carry promoted to a matrix, the same promotion Chapter 11 made for accumulation (Theorem 11.2).

The division of labor is exact, not approximate. Store a pair $(k_A, v_A)$ , then perform $T$ gated-delta writes whose keys are orthogonal to $k_A$ : the erase factors never touch the $k_A$ direction, so the retrieval decays only through the gate,

\|S_T\,k_A\| \;=\; \Bigl(\textstyle\prod_{t=1}^{T}\gamma_t\Bigr)\,\|v_A\| ,

and the companion measures the product law at $5.6 \times 10^{-17}$ over forty writes. Gating forgets uniformly (a half-life for everything); the delta rule forgets selectively (exact overwrite of what is re-keyed). An architecture with both has independent control of the two timescales — which is why this update, not the plain delta rule, is what shipped. And the gate does not cost trainability: extending §12.5’s chunkwise/WY machinery to the $\gamma$ -decayed erase product is a core contribution of the Gated DeltaNet paper Yang et al. (2025) , so the gated update trains with the same hardware efficiency.

That shipping lineage is current production history, summarized at architecture level. Kimi Linear Kimi Team (2025) builds its KDA (Kimi Delta Attention) layer as a refinement of exactly this gated update — the scalar $\gamma_t$ replaced by per-channel diagonal gates, a finer-grained forgetting clock — and interleaves KDA with full attention at a 3:1 layer ratio, reporting up to a 75% KV-cache reduction at million-token contexts. As of mid-2026 this family — the gated delta rule adopted as the linear layer of public hybrid stacks, KDA in the Kimi Linear release — is the delta rule’s production form. This book derives the update those systems share; their layer-ratio and gate-granularity choices belong to Chapter 14’s hybrid design space, and the chapter makes no derived claims about KDA beyond the gated update above.

Step back to see the family whole. Chapter 9 proved selective SSMs and masked attention are one object (Theorem 9.5); Chapter 11 added the accumulation face; this chapter added the write-rule axis: accumulate ( $\beta$ -write only), erase-and-write (delta), decay-erase-write (gated delta) — each one discretization choice on the same recall flow. Two limits now bracket the design space: the pure-decay limit ( $\beta \to 0$ , an SSM forgetting on a slow clock) and the pure-write limit (attention-like immediate binding). Pilot B’s two-timescale benchmarks live exactly in that bracket — attention as the fast boundary layer, the decaying state as the slow manifold — and Chapters 14 and 16 build the architectures and the measurement protocol for it.

12.7 What’s next

This chapter closed Chapter 9’s “delta-rule lineage” promise and Chapter 11’s capacity hand-off: the write rule is now an optimizer step, and its stability theory is Chapters 5–6 applied to a new flow.

Two threads leave here open by design. Chapter 13 takes the lineage’s next generalization: RWKV-7’s generalized delta rule (transition $\mathrm{diag} +$ rank-one in a learned direction, not the key itself) and xLSTM’s matrix memory with exponential gating — the gate moved inside the nonlinearity, with its own stabilization problem. Chapter 14 mixes these layers with attention into the production hybrids named above, where the layer-ratio and gate-granularity decisions we deferred become the design variables; Chapter 16 then builds the evaluation methodology — including the two-timescale protocol pilot B contributes — that makes the comparisons honest.

12.8 Exercises

Three short problems (solutions inline) and three longer ones (solutions in §12.9).

Exercise 12.1 (short)

With $d_k = d_v = 2$ , unit key $k = (1, 0)^\top$ , and values $v_1 = (1, 2)^\top$ , $v_2 = (3, -1)^\top$ : compute the delta-rule state after writing $(k, v_1)$ then $(k, v_2)$ with $\beta = 1$ , and the additive state $\sum_i v_i k^\top$ for the same writes. Verify the retrievals $S k$ are $v_2$ and $v_1 + v_2$ respectively.

Solution

First write: $S_1 = 0 + 1\cdot(v_1 - 0)k^\top = v_1 k^\top = \bigl(\begin{smallmatrix}1 & 0\\ 2 & 0\end{smallmatrix}\bigr)$ . Second write: $S_1 k = v_1$ , so $S_2 = S_1 + (v_2 - v_1)k^\top = v_2 k^\top = \bigl(\begin{smallmatrix}3 & 0\\ -1 & 0\end{smallmatrix}\bigr)$ and $S_2 k = v_2$ : the old binding is gone. Additive: $S = v_1 k^\top + v_2 k^\top = (v_1 + v_2)k^\top$ , retrieving $v_1 + v_2 = (4, 1)^\top$ — the stale residue of §12.2’s measured demo, by hand.

Exercise 12.2 (short, code)

Run companions/ch12/jax/delta_rule.py. Report the overwrite residuals and the $K = 32$ recall errors, then explain in one sentence why the orthonormal-key panel shows both rules exact.

Solution

The script prints delta-rule overwrite residual $4.4 \times 10^{-16}$ against an additive stale residue of $1.27$ , and $K{=}32$ mean errors $2.0$ (additive) versus $1.44$ (delta). With orthonormal keys, every cross-term $k_i^\top k_j$ ( $i \ne j$ ) vanishes, so the additive state already retrieves exactly and the erase factors act as the identity on all stored directions — there is no interference for the delta rule to remove.

Exercise 12.3 (short)

For $\alpha = 1$ and $\|k\|^2 \in \{1, 4, 100\}$ , compute Longhorn’s $\beta^{\mathrm{eff}}$ , the product $\beta^{\mathrm{eff}}\|k\|^2$ , and the radius $\rho_k$ . Verify the complement identity $\rho_k + \beta^{\mathrm{eff}}\|k\|^2 = 1$ and state what it says about where the “stable mass” goes as keys grow.

Solution

$\beta^{\mathrm{eff}} = 1/(1 + \|k\|^2) = 1/2,\ 1/5,\ 1/101$ ; products $1/2,\ 4/5,\ 100/101$ ; radii $1/2,\ 1/5,\ 1/101$ . Each pair sums to $1$ because $\rho_k = 1 - \beta^{\mathrm{eff}}\|k\|^2$ exactly (the eigenvalue is positive here). As $\|k\|^2/\alpha \to \infty$ the write share approaches $1$ and the radius approaches $0$ : large keys are written harder and contracted faster — the L-stable behavior of Proposition 6.1‘s backward Euler, in optimizer clothing.

Exercise 12.4 (theory) — solution in §12.9

Derive Theorem 12.3: starting from the stationarity equation $\alpha(S_t - S_{t-1}) + (S_t k - v)k^\top = 0$ , obtain the closed form by solving for the post-update prediction $S_t k$ first. Where exactly does $\alpha > 0$ get used?

Exercise 12.5 (theory) — solution in §12.9

Prove Theorem 12.4: compute the full spectrum of $I - \beta^{\mathrm{eff}} kk^\top$ , derive the deviation recurrence $E_t = E_{t-1}(I - \beta^{\mathrm{eff}} kk^\top)$ from either update, and conclude the stability characterizations for both step-size rules. Explain why starting from $S_0 = 0$ makes $\|E_t\|_F$ exactly geometric rather than merely asymptotically so.

Exercise 12.6 (theory) — solution in §12.9

Prove Theorem 12.5 by induction on the chunk position, and count the cost of applying $P$ to a $d_v \times d_k$ state via the WY form versus the materialized product, for chunk size $C \ll d_k$ .

12.9 Full solutions to theory exercises

Solution to Exercise 12.4

Right-multiply the stationarity equation by $k$ :

\alpha\,(S_t k - S_{t-1}k) + (S_t k - v)\,\|k\|^2 = 0 \;\Longrightarrow\; (\alpha + \|k\|^2)\,S_t k = \alpha\,S_{t-1}k + \|k\|^2\,v ,

a scalar-shaped linear equation for the vector $p := S_t k$ (the matrix unknown has collapsed because only $S_t k$ appears). Since $\alpha > 0$ the coefficient $\alpha + \|k\|^2$ is strictly positive — this is the only place positivity is needed, and it is what makes the implicit solve nonsingular for every key, including $k = 0$ . Solve:

S_t k \;=\; \frac{\alpha\,S_{t-1}k + \|k\|^2 v}{\alpha + \|k\|^2}, \qquad v - S_t k \;=\; \frac{\alpha\,(v - S_{t-1}k)}{\alpha + \|k\|^2}.

Substitute into the stationarity equation rearranged as $S_t = S_{t-1} + \tfrac{1}{\alpha}(v - S_t k)k^\top$ :

S_t \;=\; S_{t-1} + \frac{1}{\alpha}\cdot\frac{\alpha\,(v - S_{t-1}k)}{\alpha + \|k\|^2}\,k^\top \;=\; S_{t-1} + \frac{v - S_{t-1}k}{\alpha + \|k\|^2}\,k^\top . \qquad\blacksquare

Solution to Exercise 12.5

Spectrum. For any unit vector $u \perp k$ : $(I - \beta^{\mathrm{eff}} kk^\top)u = u$ , giving eigenvalue $1$ with multiplicity $d_k - 1$ . For $u = k/\|k\|$ : $(I - \beta^{\mathrm{eff}} kk^\top)k = (1 - \beta^{\mathrm{eff}} \|k\|^2)k$ , the single non-unit eigenvalue.

Deviation recurrence. Both updates have the form $S_t = S_{t-1} + \beta^{\mathrm{eff}}(v - S_{t-1}k)k^\top$ (Theorem 12.3 for Longhorn). Since $S^\star k = v$ ,

E_t = S_t - S^\star = E_{t-1} + \beta^{\mathrm{eff}}\bigl((v - S^\star k) - E_{t-1}k\bigr)k^\top = E_{t-1}\bigl(I - \beta^{\mathrm{eff}} kk^\top\bigr).

The affine parts cancel exactly — no linearization, no remainder.

Characterizations. Deviation components along $k^\perp$ (in the row space) are multiplied by $1$ each step; the component along $k$ is multiplied by $\lambda = 1 - \beta^{\mathrm{eff}}\|k\|^2$ . Convergence of the corrected direction requires $|\lambda| < 1$ , i.e. $\beta^{\mathrm{eff}}\|k\|^2 \in (0, 2)$ . DeltaNet’s free $\beta$ can violate this; Longhorn’s $\beta^{\mathrm{eff}}\|k\|^2 = \|k\|^2/(\alpha + \|k\|^2) \in (0, 1) \subset (0, 2)$ cannot, and $\lambda = \alpha/(\alpha + \|k\|^2) \in (0, 1)$ directly.

Exact geometry from $S_0 = 0$ . Then $E_0 = -S^\star = -vk^\top/\|k\|^2$ , whose every row is a multiple of $k^\top$ : the deviation starts entirely in the contracted direction, with no component on the eigenvalue- $1$ complement. Hence $\|E_t\|_F = |\lambda|^t\,\|E_0\|_F$ exactly — which is why the measured per-step ratios in §12.4 reproduce $\rho$ to twelve digits rather than only in the limit. $\qquad\blacksquare$

Solution to Exercise 12.6

Induction. Base $t = 0$ : the empty product is $I = I - 0$ , with empty factors. Step: assume $P_{1:t-1} = I - W_{1:t-1}^\top Y_{1:t-1}$ where the rows of $Y_{1:t-1}$ are $k_1, \dots, k_{t-1}$ . Then

P_{1:t} = P_{1:t-1}\bigl(I - \beta_t k_t k_t^\top\bigr) = P_{1:t-1} - \beta_t\,\bigl(P_{1:t-1}k_t\bigr)k_t^\top .

The correction is rank one with row direction $k_t^\top$ , so append $y_t = k_t$ and

w_t = \beta_t\,P_{1:t-1}k_t = \beta_t\Bigl(k_t - W_{1:t-1}^\top Y_{1:t-1} k_t\Bigr) = \beta_t\Bigl(k_t - \sum_{j<t}(k_j^\top k_t)\,w_j\Bigr),

which is the stated recursion, and $P_{1:t} = I - W_{1:t}^\top Y_{1:t}$ .

Cost. WY application $SP = S - (SW^\top)Y$ : one $(d_v \times d_k)(d_k \times C)$ GEMM and one $(d_v \times C)(C \times d_k)$ GEMM, $O(d_v d_k C)$ total, plus the $O(C^2 d_k)$ one-time factor build. The materialized product costs $O(C\,d_k^2)$ to build and $O(d_v d_k^2)$ to apply: for $C \ll d_k$ the WY route wins by a factor $\sim d_k/C$ on both counts, and — the practical point — both of its operations are GEMMs, the shape hardware wants. $\qquad\blacksquare$

12.10 Companion code

The companions live in companions/ch12/{jax,torch,julia} and are float64 throughout. The JAX modules are canonical and produce all four figures; the torch modules mirror the three operator updates with cross-framework parity pinned to $< 10^{-9}$ ; the Julia module ports the §12.4 stability analysis to a third language using only the standard library, pinning the same closed forms and trajectory ratios.

# JAX (canonical; emits the four figures to public/figures/ch12/)
PYTHONPATH=. python companions/ch12/jax/delta_rule.py
PYTHONPATH=. python companions/ch12/jax/longhorn.py
PYTHONPATH=. python companions/ch12/jax/stability.py
PYTHONPATH=. python companions/ch12/jax/chunkwise.py
PYTHONPATH=. python companions/ch12/jax/gated_delta.py

# Tests: JAX identities (< 1e-12), torch parity (< 1e-9), Julia stability suite
.venv/bin/pytest companions/ch12/jax companions/ch12/torch -q
julia --project=companions/ch12/julia companions/ch12/julia/runtests.jl

delta_rule.py — the explicit step, its two algebraic faces, the fixed point, overwrite-vs-accumulate semantics, and the recall comparison figure (§12.1–12.2).
longhorn.py — the implicit step via its closed form, verified against an independent dense solve of the stationarity system, plus the structural identity with the delta rule and the exact geometric error trajectories (§12.3–12.4).
stability.py — the closed-form spectral radii, the named boundary, and the Rayleigh-quotient drift guard behind the stability-regions figure (§12.4).
chunkwise.py — the WY representation, its two-GEMM application, and the chunkwise≡recurrent certificate (§12.5).
gated_delta.py — the gated update, both exact reductions, and the $\gamma^T$ uniform-forgetting law (§12.6). The torch mirror’s test file also documents the buffers-vs-Parameters distinction (a learned rate projection is a Parameter; a fixed rate floor is a buffer).

Part III · Beyond SSMs Week 13 Published

Exponential gates and matrix memory: xLSTM and RWKV-7

One generalized linear recurrence whose transition is diagonal-plus-rank-one — RWKV-7's generalized delta rule (rank-one removal in a learned direction, not the key) and xLSTM's matrix memory with exponential gating (and the log-domain stabilizer state that overflow forces) — extending Chapter 12's delta-rule lineage to matrix memories with their own stability questions.

Exponential gates and matrix memory: xLSTM and RWKV-7

Chapter 13 — at a glance

Goal: take Chapter 12’s delta-rule lineage one generalization further. Both of this chapter’s architectures keep the linear recurrence $S_t = S_{t-1} A_t + \text{write}_t$ but free the transition $A_t$ from the delta rule’s rigid $\gamma_t(I - \beta_t k_t k_t^\top)$ to a full diagonal-plus-rank-one form: RWKV-7’s generalized delta rule removes along a learned direction rather than the key, and xLSTM’s mLSTM carries a matrix memory written with exponential gates — gates moved inside the nonlinearity, which buys expressivity at the price of a numerical stability problem the architecture must solve in its own state.

Reading time: ~45 minutes prose; ~2 hours with the companions.

Key insight: moving the gate inside the exponential imports a stability question into the state. The book’s stability thread — bounded states (Chapter 2), the explicit/implicit dichotomy (Chapters 5–6), the delta rule’s finite stability interval (Chapter 12) — returns here in two registers: as the spectrum of the generalized transition (when does $\mathrm{Diag}(w) - c\,aa^\top$ contract?), a question about the state’s trajectory; and as the overflow of the exponential input gate (xLSTM’s log-domain max-state is the numerical shadow of an implicit method’s unconditional guarantee — stability of the computation where Chapter 6 gave stability of the dynamics).

Looking ahead: Chapter 13 is not a pilot anchor — C1’s prerequisites closed at Chapter 10 and B’s at Chapter 16. It completes the beyond-SSM lineage and sets up Chapter 15, whose diagnostic tools (Lyapunov exponents, regime detection) probe exactly the matrix-memory stability questions raised here, on trained models rather than at the architecture level.

13.1 The move: the gate inside the nonlinearity

Chapter 12 read a sequence layer’s state update as one step of an optimizer and organized the delta-rule family around a single linear recurrence,

S_t \;=\; S_{t-1}\,A_t \;+\; \text{write}_t ,

with the state $S \in \mathbb{R}^{d_v \times d_k}$ retrieved by $S q$ and the transition acting on the right. Every architecture in that chapter shared one transition shape: $A_t = \gamma_t\,(I - \beta_t k_t k_t^\top)$ — a scalar decay $\gamma_t$ times the identity, minus a rank-one term locked to the write key $k_t$ . Gated DeltaNet’s $\gamma_t$ was the only freedom beyond the delta rule, and it was a single number per step (Proposition 14.3 measures what that granularity costs).

Two threads were left open. Chapter 12 ended at the capacity wall (Proposition 11.4): a rank- $d_k$ state holds at most $d_k$ independent associations, and the delta rule erases interference without raising that ceiling. And its transition forgot only in the write direction — the orthogonal complement was inert, persistent memory that gradient flow on the current pair never cleans up. This chapter’s two architectures attack both by generalizing the transition itself, in two independent generalizations of the same linear recurrence:

RWKV-7 Peng et al. (2025) keeps the rank-one shape but frees the removal direction and the diagonal: the erase direction need not be the key, and the decay is per-channel, not scalar. This is its “generalized delta rule,” and Chapter 12’s whole lineage falls out as a special case.
xLSTM Beck et al. (2024) keeps a matrix memory $C_t$ and writes it with exponential gates — the input gate is $\exp(\tilde i_t)$ , unbounded — which requires a stabilizer state to keep the recurrence inside float range.

The unifying object is the generalized transition

A_t \;=\; \mathrm{Diag}(w_t) \;-\; c_t\, a_t a_t^\top, \qquad \|a_t\| = 1,\ c_t \ge 0 ,

a diagonal decay $w_t$ (replacing Chapter 12’s scalar $\gamma_t I$ ) minus a rank-one removal in a unit direction $a_t$ (replacing the key-locked $k_t$ ). The rest of the chapter is two questions about this one matrix — when does it contract (§13.2), and what does freeing $a_t$ buy (§13.3) — followed by xLSTM’s exponential-gate variant and the stabilizer it forces (§13.4).

13.2 The generalized transition and its spectrum

Whether the recurrence $S_t = S_{t-1} A_t + \text{write}_t$ keeps a bounded state is, as in Chapter 12, a question about the eigenvalues of the transition. The generalized $A_t$ is built from a diagonal and a symmetric rank-one term, so it is itself symmetric — and therefore has a real spectrum we can read exactly.

Proposition 13.1 (Spectrum of the diagonal-plus-rank-one transition).

Let $A = \mathrm{Diag}(w) - c\,a a^\top$ with $a \in \mathbb{R}^d$ a unit vector and $c \ge 0$ . Then $A$ is symmetric, and:

(Secular equation, interlacing.) Its eigenvalues are the $w_i$ for which $a_i = 0$ , together with the roots $\lambda \notin \{w_i\}$ of $f(\lambda) \;=\; 1 - c\sum_{i} \frac{a_i^2}{w_i - \lambda} \;=\; 0 .$ The rank-one downdate pushes every eigenvalue down without crossing the next diagonal entry, so the spectrum interlaces the sorted $w_i$ and lies in $[\,\min_i w_i - c,\ \max_i w_i\,]$ .
(Scalar-diagonal case.) If $w = w_0\mathbf 1$ , exactly one eigenvalue moves: the spectrum is $\{w_0 - c\}\cup\{w_0\}^{d-1}$ , with spectral radius $\rho(A) = \max(|w_0|,\ |w_0 - c|)$ . At $w_0 = 1$ this is $\max(1, |1 - c|)$ , recovering DeltaNet’s $k$ -direction radius $|1 - \beta\|k\|^2|$ (Theorem 12.4) under the identification $c = \beta\|k\|^2$ — and the contractivity boundary $c = 2$ is exactly forward Euler’s.

The proof is the matrix-determinant lemma: $\det(A - \lambda I) = \prod_i(w_i - \lambda)\cdot f(\lambda)$ , so the roots of $f$ are the eigenvalues away from the diagonal entries; the scalar case is the direct eigendecomposition of $w_0 I - c\,aa^\top$ , whose only non- $w_0$ eigenvalue lies along $a$ (Exercise 13.4 carries the derivation). The companion reads the spectrum two ways — eigvalsh of the materialized $A$ against the scalar-diagonal closed form — and they agree to a measured $6.7 \times 10^{-16}$ ; the secular residual at the general-diagonal eigenvalues is $3.2 \times 10^{-12}$ .

One reading of Proposition 13.1 matters for everything that follows. With a scalar diagonal at $w_0 = 1$ , the radius is $\max(1, |1 - c|) \ge 1$ : the transition is never strictly contractive, because the $d - 1$ directions orthogonal to $a$ sit at eigenvalue exactly $1$ — Chapter 12’s inert persistent memory, unchanged. Strict contraction of those directions is what the diagonal $w$ buys: entries $w_i < 1$ forget per-channel, the generalization of Gated DeltaNet’s single $\gamma_t$ to a vector. Freeing the diagonal and freeing the removal direction are the two independent degrees of freedom RWKV-7 adds, and the spectrum shows they do different jobs — decay versus erasure.

Two panels. Left: for a scalar diagonal at w0 = 1, two eigenvalue curves versus the removal coefficient c — a flat line at 1 (the persistent directions) and a descending line 1 minus c that crosses below minus 1 at c = 2, where the unstable region is shaded and annotated as Chapter 12's boundary. Right: for a general six-entry diagonal, six eigenvalue trajectories all sliding downward as c grows, each staying between dotted lines marking the diagonal entries, illustrating interlacing. — The generalized transition's spectrum, from Proposition 13.1. Left: scalar diagonal $w_0 = 1$ — one eigenvalue slides to $w_0 - c = 1 - c$ and crosses $-1$ at $c = 2$, exactly DeltaNet's $\beta\|k\|^2 = 2$ boundary; the other $d-1$ eigenvalues stay pinned at $1$ (persistent memory). Right: a general six-entry diagonal — the rank-one downdate pushes the whole spectrum down, every eigenvalue interlacing the diagonal entries (dotted) and confined to $[\min_i w_i - c, \max_i w_i]$. Produced by companions/ch13/jax/generalized_transition.py; eigvalsh matches the scalar-diagonal closed form to a measured $6.7\times10^{-16}$ (pinned $<10^{-13}$) and the secular function zeroes at the eigenvalues to a measured $3.2\times10^{-12}$, pinned in tests/test_generalized_transition.py.

13.3 RWKV-7’s generalized delta rule

RWKV-7 Peng et al. (2025) runs the generalized transition over a stream,

S_t \;=\; S_{t-1}\bigl(\mathrm{Diag}(w_t) - c_t\, a_t a_t^\top\bigr) \;+\; u_t b_t^\top ,

with a per-channel decay $w_t$ , a unit removal direction $a_t$ , a removal coefficient $c_t$ , and a write $u_t b_t^\top$ whose key $b_t$ is decoupled from $a_t$ . The paper’s own framing is “a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates” — and the precise sense in which it generalizes the delta rule is a one-line reduction.

Proposition 13.2 (The generalized rule contains the delta-rule lineage).

Set $w_t = \gamma_t\mathbf 1$ (scalar diagonal), $a_t = k_t/\|k_t\|$ (removal direction = unit key), $c_t = \gamma_t\,\beta_t\,\|k_t\|^2$ , and write $u_t b_t^\top = \beta_t\, v_t k_t^\top$ . Then the generalized recurrence equals Gated DeltaNet’s update

S_t \;=\; \gamma_t\, S_{t-1}\,(I - \beta_t k_t k_t^\top) \;+\; \beta_t\, v_t k_t^\top

identically. Hence plain DeltaNet ( $\gamma_t \equiv 1$ ) and the entire Chapter 12 lineage are the scalar-diagonal, removal-locked-to-the-key special case; RWKV-7’s added freedom is precisely the per-channel diagonal $w_t$ and the learned direction $a_t \ne k_t$ .

The proof substitutes and matches: $c_t a_t a_t^\top = \gamma_t\beta_t\|k_t\|^2 \cdot \hat k_t \hat k_t^\top = \gamma_t\beta_t\, k_t k_t^\top$ , and the scalar diagonal contributes the $\gamma_t I$ , so $\mathrm{Diag}(w_t) - c_t a_t a_t^\top = \gamma_t(I - \beta_t k_t k_t^\top)$ (Exercise 13.5). The companion runs the generalized recurrence with these parameters against Chapter 12’s gated_delta_recurrent and pins the difference at $8.9 \times 10^{-16}$ ; the rank-one lax.scan form matches the materialized-transition Python-loop oracle at $5.3 \times 10^{-15}$ .

What does the learned direction buy, concretely? In Chapter 12 the only way to remove the association stored on a key $k_A$ was to write again with key $k_A$ — erasure was welded to the write. RWKV-7 can aim its rank-one removal at $k_A$ while writing some other, orthogonal pair, evicting the old association without overwriting it. Store $(k_A, v_A)$ on a unit key, then perform $T$ writes whose keys are orthogonal to $k_A$ :

with the removal aimed at $k_A$ (direction $a_t = k_A$ , coefficient $c$ ), the retrieval decays as exactly $\|S_T k_A\| = (1 - c)^T\,\|v_A\|$ — eviction with no overwrite;
with the removal locked to the write key (Chapter 12’s $a_t = b_t \perp k_A$ ), the orthogonal writes never touch $k_A$ and $\|S_T k_A\| = \|v_A\|$ — flat, the same inert-complement fact from §13.2.

The companion measures the decay at $c = 0.1$ , $T = 30$ as $\|S_T k_A\| = 0.028576$ , matching $(1-c)^T\|v_A\|$ to $1.7 \times 10^{-17}$ , against a flat locked baseline of $\|v_A\| = 0.674096$ . Decoupling the erase direction from the key is the whole content of “generalized” — and it is what lets RWKV-7 track state, not merely store associations.

A single plot of retrieval norm of an old key versus number of orthogonal writes T, from 0 to 90. One curve labeled Chapter 12 locked stays flat at about 0.67. A second set of points labeled RWKV-7 targeted decays smoothly toward zero, overlaid with a dashed analytic curve (1 minus c) to the T power times the value norm at c equals 0.1. — What freeing the removal direction buys. Storing $(k_A, v_A)$ then performing $T$ writes orthogonal to $k_A$: RWKV-7's *targeted* removal ($a_t = k_A$) evicts the old association as exactly $(1-c)^T\|v_A\|$ ($c = 0.1$; measured $0.028576$ at $T = 30$, matching the closed form to $1.7\times10^{-17}$), while Chapter 12's *key-locked* removal leaves it flat at $\|v_A\| = 0.674096$ — orthogonal writes cannot touch it. Produced by companions/ch13/jax/generalized_transition.py; pinned in tests/test_generalized_transition.py.

13.4 xLSTM: matrix memory and exponential gating

xLSTM Beck et al. (2024) arrives at the same matrix recurrence from the LSTM side rather than the delta-rule side. Its mLSTM cell carries a matrix memory $C_t \in \mathbb{R}^{d_v \times d_k}$ and a normalizer vector $n_t$ ,

C_t = f_t\,C_{t-1} + i_t\,v_t k_t^\top, \qquad n_t = f_t\,n_{t-1} + i_t\,k_t, \qquad h_t = \frac{C_t q_t}{\max(|n_t^\top q_t|,\ 1)} ,

written with scalar gates the architecture moves inside an exponential: the forget gate $f_t = \sigma(\tilde f_t) \in (0, 1]$ is bounded, but the input gate $i_t = \exp(\tilde i_t)$ is not. An unbounded input gate is the deliberate move: it lets a single high-surprise token’s write dominate the entire running memory — a sharper, winner-take-more selectivity than any sigmoid can express. (xLSTM’s other cell, sLSTM, applies the same exponential gates to a scalar memory with a memory-mixing recurrence; the matrix mLSTM is this chapter’s subject.) That expressivity is exactly the source of the problem: $\exp(\tilde i_t)$ overflows float64 once $\tilde i_t \gtrsim 709$ , and the naive recurrence then produces inf/nan.

The fix is a stabilizer state, a running maximum carried in the log domain:

m_t \;=\; \max\bigl(\log f_t + m_{t-1},\ \log i_t\bigr), \qquad m_0 = -\infty ,

with which the gates are rescaled and the memory is carried in scaled coordinates $C_t \exp(-m_t)$ , $n_t \exp(-m_t)$ . This is the online-softmax running-max trick, and it is exact:

Proposition 13.3 (The exponential-gate stabilizer is exact, not approximate).

Let $\bar C_t, \bar n_t$ be the naive mLSTM states with raw gates $\bar f_t = \exp(\log f_t)$ , $\bar i_t = \exp(\log i_t)$ , and define the stabilized states $C_t = e^{-m_t}\bar C_t$ , $n_t = e^{-m_t}\bar n_t$ via the max-state $m_t$ above. Then:

(Range.) The rescaled gates $f'_t = e^{\log f_t + m_{t-1} - m_t}$ and $i'_t = e^{\log i_t - m_t}$ lie in $(0, 1]$ (because $m_t \ge \log f_t + m_{t-1}$ and $m_t \ge \log i_t$ by definition of the max), so the scaled recurrence $C_t = f'_t C_{t-1} + i'_t v_t k_t^\top$ never overflows.
(Exactness.) The readout is invariant under the rescaling: $\frac{\bar C_t q_t}{\max(|\bar n_t^\top q_t|,\ 1)} \;=\; \frac{C_t q_t}{\max(|n_t^\top q_t|,\ e^{-m_t})} .$ The only change is that the floor $1$ becomes $e^{-m_t}$ . Wherever the naive recurrence is finite the two agree to machine precision; where it overflows, the stabilized one stays finite and still correct.

Part 1 is the definition of the max; part 2 factors $e^{m_t}$ out of numerator and denominator (Exercise 13.6). The one place the rescaling is visible is the readout floor — the constant $1$ becomes $e^{-m_t}$ — and getting that scaling right is exactly what makes the stabilizer a change of variables rather than a numerical approximation: it buys range and costs nothing in the answer. The companion confirms both halves: in the safe regime the stabilized and naive readouts agree to $3.1 \times 10^{-15}$ ; pushed to an input log-gate of $700$ , the naive memory reaches a peak entry of $5.9 \times 10^{303}$ — finite, but a hair from the ceiling — and the readouts still agree to $8.9 \times 10^{-16}$ ; at $760$ the naive recurrence overflows ( $50\%$ of its readout entries become nan), while the stabilized one stays bounded (peak entry $2.5$ ) and exact.

The cleanest statement of what the stabilizer guarantees is a recovery property. Store one pair $(k, v)$ on a unit key with input log-gate $\tilde i$ , then read at $q = k$ : the stabilized readout is exactly $v$ , for any $\tilde i$ . With one write, $C_1 = i'_1\, v k^\top$ and $n_1 = i'_1\, k$ with $i'_1 = 1$ , so $h_1 = v\,\|k\|^2 / \max(\|k\|^2, e^{-m_1}) = v$ — the normalizer cancels the gate, which is what the exponential input gate is for. The companion stores the fixed pair $v = [0.3, -0.7, 1.1]$ and recovers it to $1.1 \times 10^{-16}$ even at $\tilde i = 800$ , where the naive recurrence has long since overflowed; the Julia companion recovers the identical value, a cross-language anchor on the stabilized readout.

This is the book’s stability thread reappearing inside the architecture. Chapter 2 asked when a state stays bounded; Chapters 5–6 answered it for integrators with the explicit/implicit dichotomy, and Chapter 6’s implicit methods bought unconditional stability by solving rather than extrapolating (Proposition 6.1). xLSTM’s stabilizer is the same bargain made numerically: it does not change which mathematical object the recurrence computes, it changes the coordinates so the computation stays in range — an unconditional guarantee that the matrix memory never overflows, paid for with one extra scalar of state.

Two panels. Left: log base 10 of the peak memory entry versus the peak input log-gate, on a linear axis; the naive curve rises linearly and stops at a vertical line where it overflows near 710, with a dashed horizontal line at the float64 ceiling exponent of about 308, while the stabilized curve stays flat near zero. Right: the maximum readout gap between naive and stabilized versus the input log-gate on a log axis, sitting near 1e-15 below the 1e-12 pin line until a shaded region where the naive output becomes nan. — The exponential input gate and its stabilizer (Proposition 13.3). Left: the naive matrix memory's peak entry grows like $\exp(\tilde i)$ and overflows float64 at $\tilde i \approx 710$ (vertical line; ceiling $\sim 10^{308}$ dashed), while the stabilized memory stays $O(1)$. Right: where the naive recurrence survives the two readouts agree to $\sim 10^{-15}$ (the change-of-variables exactness), and past the overflow cliff the naive output is `nan` while the stabilized one remains exact. Produced by companions/ch13/jax/xlstm.py; the safe-regime gap is pinned $<10^{-12}$ and the overflow behavior in tests/test_xlstm.py.

13.5 Stability, the trigger taxonomy, and the production lineup

Both architectures put a stability question inside the state — but in two different registers, and the difference is the point. RWKV-7’s is dynamical: the generalized transition contracts only when the diagonal decays and the rank-one removal stays within the boundary Proposition 13.1 draws — the explicit-method story from Chapter 12, now per-channel, a statement about the trajectory of the state. xLSTM’s is numerical: the exponential gate would overflow, and the max-state stabilizer is an in-state guarantee that it does not — not a claim about the dynamics at all, but about keeping the computation in float range, the implicit method’s unconditional-stability bargain made about representation rather than about the step. Chapter 15 returns to both with diagnostic tools (Lyapunov exponents, regime detection) that measure these stabilities on trained networks rather than asserting them at the architecture level.

There is a taxonomy worth closing here. Chapter 14 organized gates by what triggers a memory write — the input (content), the output, or a fixed layer schedule (Proposition 14.3). xLSTM’s exponential input gate is the sharpest input-triggered write in production. The remaining class is triggered by neither input nor position but by the model’s own loss: the Titans line Behrouz et al. (2025) writes to a neural long-term memory at test time, taking a gradient step on a surprise objective as it reads — a memory whose update rule is itself learned and fired by prediction error. That makes it a fourth trigger class (the loss), and it lives on a matrix memory of exactly the kind this chapter studies; its test-time-training mechanics are a distinct enough subject that this book names and places the trigger class here but does not pursue the test-time-training machinery further.

In production as of mid-2026 these are not toys: xLSTM has been trained at the $7$ B-parameter scale and RWKV-7 (“Goose”) at $2.9$ B, the latter reaching state-of-the-art multilingual performance at its size on dramatically fewer training tokens than comparable models Peng et al. (2025) . The diagonal-plus-rank-one transition and the exponential-gate stabilizer are not pedagogical simplifications — they are what these models run.

13.6 What’s next

Chapter 14 has already mixed these layers with attention into production hybrids, where the layer-ratio and gate-granularity decisions become design variables; Chapter 16 built the methodology that measures them honestly. The natural sequel to this chapter is Chapter 15, the counter-evidence file: it takes the stability questions raised here — when does a matrix memory’s recurrence stay bounded, and what can it provably not do — and turns them into diagnostics (Lyapunov exponents, effective state size) and impossibility results (the $\mathsf{TC}^0$ ceiling RWKV-7’s state-tracking claim brushes against). The generalized transition and the stabilizer state are the objects those diagnostics will probe.

13.7 Exercises

Exercise 13.1 (short)

For the scalar-diagonal transition $A = I - c\,aa^\top$ (so $w_0 = 1$ , $a$ unit), use Proposition 13.1 to give the spectral radius $\rho(A)$ at $c = 0.5$ , $c = 1.0$ , and $c = 2.5$ . Which is non-contractive, and why is $\rho(A) \ge 1$ for every $c$ at this diagonal?

Solution

The spectrum is $\{1 - c\}\cup\{1\}^{d-1}$ , so $\rho = \max(1, |1 - c|)$ . At $c = 0.5$ : $\max(1, 0.5) = 1$ . At $c = 1.0$ : $\max(1, 0) = 1$ . At $c = 2.5$ : $\max(1, 1.5) = 1.5$ — non-contractive (the iteration diverges, exactly forward Euler past its boundary). The radius is $\ge 1$ for every $c$ because the $d-1$ directions orthogonal to $a$ sit at eigenvalue $1$ : with the diagonal pinned at $w_0 = 1$ there is no per-channel decay, so the rank-one removal can shrink only the single direction $a$ . Strict contraction requires diagonal entries $w_i < 1$ — the role of RWKV-7’s vector-valued $w_t$ .

Exercise 13.2 (short, code)

Run companions/ch13/jax/generalized_transition.py. Report the P3 reduction error (generalized rule vs Chapter 12’s gated DeltaNet) and the decoupled-eviction measurement at $c = 0.1$ , $T = 30$ : the targeted retrieval norm and the locked baseline. What does the gap between them demonstrate?

Solution

The reduction error is $8.9 \times 10^{-16}$ — the generalized recurrence reproduces gated DeltaNet to machine precision, confirming Proposition 13.2. The eviction run reports targeted $\|S_T k_A\| = 0.028576$ (matching $(1-0.1)^{30}\|v_A\|$ to $1.7\times10^{-17}$ ) against a locked baseline of $0.674096 = \|v_A\|$ . The gap demonstrates the payoff of the learned direction: RWKV-7 can evict an old key by aiming its removal at it, while Chapter 12’s key-locked removal — under writes orthogonal to $k_A$ — leaves the association untouched. Decoupling the erase direction from the key is what “generalized” means.

Exercise 13.3 (short, code)

Run companions/ch13/jax/xlstm.py. Report the safe-regime readout gap between the naive and stabilized recurrences, and the single-pair recovery error at input log-gate $800$ . Why is the recovery error not affected by the gate’s magnitude?

Solution

The safe-regime readout gap is $3.1 \times 10^{-15}$ (the change-of-variables exactness of Proposition 13.3), and single-pair recovery at $\tilde i = 800$ returns the stored value to $1.1 \times 10^{-16}$ — even though $\exp(800)$ overflows float64 and the naive recurrence is nan there. The magnitude does not matter because the readout is a ratio: a single write makes $C_1 = i'_1 v k^\top$ and $n_1 = i'_1 k$ with the same rescaled gate $i'_1 = 1$ , and reading at $q = k$ gives $C_1 k / \max(|n_1^\top k|, e^{-m_1}) = v$ . The normalizer cancels the input gate exactly — which is precisely what an unbounded input gate needs in order to be usable.

Exercise 13.4 (theory) — solution in §13.8

Prove part 1 of Proposition 13.1. Using the matrix- determinant lemma $\det(M + uv^\top) = \det(M)(1 + v^\top M^{-1} u)$ , derive the secular equation $f(\lambda) = 1 - c\sum_i a_i^2/(w_i - \lambda) = 0$ for the eigenvalues of $\mathrm{Diag}(w) - c\,aa^\top$ away from the diagonal entries, and specialize to $w = w_0\mathbf 1$ to recover the scalar-diagonal spectrum.

Exercise 13.5 (theory) — solution in §13.8

Prove Proposition 13.2. Substitute $w_t = \gamma_t\mathbf 1$ , $a_t = k_t/\|k_t\|$ , $c_t = \gamma_t\beta_t\|k_t\|^2$ , $u_t b_t^\top = \beta_t v_t k_t^\top$ into the generalized recurrence and show term-by-term that it equals Gated DeltaNet’s update. Identify which special case recovers plain DeltaNet, and which two degrees of freedom RWKV-7 adds over it.

Exercise 13.6 (theory) — solution in §13.8

Prove Proposition 13.3. (a) Show $f'_t, i'_t \in (0, 1]$ from the definition of $m_t$ . (b) By induction, show $C_t = e^{-m_t}\bar C_t$ and $n_t = e^{-m_t}\bar n_t$ where $\bar C_t, \bar n_t$ are the raw-gate states. (c) Conclude the readout identity, and explain why the constant floor $1$ must become $e^{-m_t}$ rather than staying $1$ .

13.8 Full solutions to theory exercises

Solution to Exercise 13.4

Write $A = \mathrm{Diag}(w) - c\,a a^\top$ and fix $\lambda \notin \{w_i\}$ , so $\mathrm{Diag}(w) - \lambda I$ is invertible. Apply the matrix-determinant lemma with $M = \mathrm{Diag}(w) - \lambda I$ , $u = a$ , $v = -c\,a$ :

\det(A - \lambda I) = \det(M)\,\bigl(1 - c\,a^\top M^{-1} a\bigr) = \prod_i (w_i - \lambda)\,\Bigl(1 - c\sum_i \frac{a_i^2}{w_i - \lambda}\Bigr) .

Since $\prod_i(w_i - \lambda) \ne 0$ for $\lambda \notin \{w_i\}$ , the eigenvalues there are exactly the roots of $f(\lambda) = 1 - c\sum_i a_i^2/(w_i - \lambda)$ . (A diagonal entry $w_j$ is itself an eigenvalue iff $a_j = 0$ , with eigenvector $e_j$ ; then the $j$ -term drops out of $f$ .) On each interval between consecutive distinct $w_i$ , $f$ is continuous and strictly monotone with poles of opposite sign at the endpoints, giving exactly one root per gap — the interlacing — and because $c \ge 0$ shifts $f$ so that all roots lie at or below the corresponding $w_i$ , the spectrum sits in $[\min_i w_i - c, \max_i w_i]$ .

For $w = w_0\mathbf 1$ : $\mathrm{Diag}(w) - c aa^\top = w_0 I - c\,aa^\top$ . Any vector orthogonal to $a$ is an eigenvector with eigenvalue $w_0$ (the rank-one term annihilates it), giving multiplicity $d - 1$ ; and $a$ itself is an eigenvector, $(w_0 I - c aa^\top)a = (w_0 - c)a$ since $\|a\| = 1$ . So the spectrum is $\{w_0 - c\}\cup\{w_0\}^{d-1}$ and $\rho = \max(|w_0|, |w_0 - c|)$ . At $w_0 = 1$ , $\rho = \max(1, |1 - c|)$ , equal to $|1 - c|$ once $c \ge 0$ pushes the moving eigenvalue below $1$ ; with $c = \beta\|k\|^2$ this is DeltaNet’s $|1 - \beta\|k\|^2|$ in the write direction, and $\rho > 1$ exactly when $c > 2$ . $\blacksquare$

Solution to Exercise 13.5

The generalized recurrence is $S_t = S_{t-1}(\mathrm{Diag}(w_t) - c_t a_t a_t^\top) + u_t b_t^\top$ . Substitute the four assignments. The diagonal term is $S_{t-1}\,\mathrm{Diag}(\gamma_t\mathbf 1) = \gamma_t S_{t-1}$ . The rank-one term is

c_t\, a_t a_t^\top = \gamma_t\beta_t\|k_t\|^2 \cdot \frac{k_t}{\|k_t\|}\frac{k_t^\top}{\|k_t\|} = \gamma_t\beta_t\, k_t k_t^\top ,

so $S_{t-1}(\mathrm{Diag}(w_t) - c_t a_t a_t^\top) = \gamma_t S_{t-1} - \gamma_t\beta_t\, S_{t-1} k_t k_t^\top = \gamma_t S_{t-1}(I - \beta_t k_t k_t^\top)$ . The write term is $u_t b_t^\top = \beta_t v_t k_t^\top$ . Adding gives

S_t = \gamma_t S_{t-1}(I - \beta_t k_t k_t^\top) + \beta_t v_t k_t^\top ,

which is exactly Gated DeltaNet’s update (Theorem 12.4). Setting $\gamma_t \equiv 1$ (equivalently $w_t = \mathbf 1$ ) recovers plain DeltaNet, $S_t = S_{t-1}(I - \beta_t k_t k_t^\top) + \beta_t v_t k_t^\top$ . The two degrees of freedom RWKV-7 adds over the lineage are (i) the per-channel diagonal $w_t$ in place of the scalar $\gamma_t$ — vector-valued forgetting, the strict-contraction lever of §13.2 — and (ii) the learned removal direction $a_t \ne k_t/\|k_t\|$ , which decouples erasure from the write key (§13.3). $\blacksquare$

Solution to Exercise 13.6

(a) By definition $m_t = \max(\log f_t + m_{t-1}, \log i_t)$ , so $m_t \ge \log f_t + m_{t-1}$ and $m_t \ge \log i_t$ . Hence $\log f_t + m_{t-1} - m_t \le 0$ and $\log i_t - m_t \le 0$ , so $f'_t = e^{\log f_t + m_{t-1} - m_t} \in (0, 1]$ and $i'_t = e^{\log i_t - m_t} \in (0, 1]$ . Both are positive (exponentials) and at most $1$ .

(b) Induct on $t$ . At $t = 0$ , $\bar C_0 = C_0 = 0$ and $m_0 = -\infty$ gives $e^{-m_0} = 0$ , consistent with the convention that the first write ( $f'_1 = 0$ ) discards the empty initial state. Assume $C_{t-1} = e^{-m_{t-1}}\bar C_{t-1}$ . Then

C_t = f'_t C_{t-1} + i'_t v_t k_t^\top = e^{\log f_t + m_{t-1} - m_t} e^{-m_{t-1}}\bar C_{t-1} + e^{\log i_t - m_t} v_t k_t^\top = e^{-m_t}\bigl(\bar f_t \bar C_{t-1} + \bar i_t v_t k_t^\top\bigr) = e^{-m_t}\bar C_t ,

using $\bar f_t = e^{\log f_t}$ , $\bar i_t = e^{\log i_t}$ . The identical computation gives $n_t = e^{-m_t}\bar n_t$ .

(c) Substitute $\bar C_t = e^{m_t}C_t$ and $\bar n_t = e^{m_t}n_t$ into the naive readout:

\frac{\bar C_t q_t}{\max(|\bar n_t^\top q_t|, 1)} = \frac{e^{m_t} C_t q_t}{\max(e^{m_t}|n_t^\top q_t|, 1)} = \frac{e^{m_t} C_t q_t}{e^{m_t}\max(|n_t^\top q_t|, e^{-m_t})} = \frac{C_t q_t}{\max(|n_t^\top q_t|, e^{-m_t})} .

The floor must scale to $e^{-m_t}$ : the constant $1$ inside the naive $\max$ lives in unscaled coordinates, and factoring $e^{m_t}$ out of the denominator divides it down to $e^{-m_t}$ . Keeping the floor at $1$ in scaled coordinates would change the readout whenever the normalizer is small — the floor is the one place the change of variables is visible, and getting it right is what makes the stabilizer exact rather than approximate. $\blacksquare$

13.9 Companion code

The companions live in companions/ch13/{jax,torch,julia} and are float64 throughout; the JAX modules are canonical, the torch modules mirror their scoring paths with cross-framework parity pinned $<10^{-9}$ , and the Julia module cross-checks the stabilizer.

jax/generalized_transition.py — the diagonal-plus-rank-one transition, its spectrum (eigvalsh against the scalar-diagonal closed form and the secular equation, Proposition 13.1), the generalized delta-rule recurrence (rank-one lax.scan vs a materialized-transition Python-loop oracle), the exact reduction to Chapter 12’s gated DeltaNet (Proposition 13.2), and the decoupled-eviction demo. Produces transition-spectrum.png and learned-direction.png.
jax/xlstm.py — the mLSTM matrix-memory recurrence, the naive (overflow-prone) and log-domain-stabilized forms, the change-of-variables exactness (Proposition 13.3), the overflow cliff, and the single-pair recovery anchor. Produces stabilizer-overflow.png.
torch/{generalized_transition,xlstm}.py — eager mirrors of the scoring paths, pinned to the JAX outputs in torch/tests/test_ch13_torch.py.
julia/xlstm_stabilization.jl — a stdlib-only cross-language check of the stabilizer: the same P2 exactness, gate boundedness, overflow behavior, and the single-pair recovery of $v = [0.3, -0.7, 1.1]$ at input log-gate $800$ .

PYTHONPATH=. python companions/ch13/jax/generalized_transition.py
PYTHONPATH=. python companions/ch13/jax/xlstm.py
make companion-jax-tests          # all chapters' JAX suites
make companion-torch-tests        # JAX↔torch parity
julia --project=companions/ch13/julia companions/ch13/julia/runtests.jl

Part IV · Integration Week 14 Published

Hybrid architectures and gating mechanisms

Hybrids read through the two-timescale lens — attention as the boundary layer (exact on fast, token-local structure), the decaying state as the slow manifold (compressed carried context), and composition and gate granularity as the design variables; with the production lineup as of May 2026, the MAD methodology, and the two-timescale benchmark seed for pilot B.

Hybrid architectures and gating mechanisms

Chapter 14 — at a glance

Goal: read hybrid architectures — stacks that mix attention layers with SSM/linear-recurrence layers — through the book’s dynamical lens. The classical fast–slow (singular-perturbation) decomposition — Prandtl’s boundary layer, Tikhonov’s theorem — makes the division of labour precise: attention is exact on token-local structure but carries no compressed state; a decaying state tracks slow latent structure at $O(1)$ memory but must be timed correctly. The two design variables the production lineup actually tunes — the layer ratio and the gate granularity — are then budget and routing decisions on top of that division.

Reading time: ~45 minutes prose; ~2 hours with the companions.

Key insight: a hybrid is a budget split between exactness and memory. Sliding-window attention pays $O(w)$ cache for exact pairwise computation inside the window; a gated-decay state pays $O(1)$ for compressed context that must match its forgetting rate to the timescale of what it tracks. The two-timescale task of §14.6 makes both statements measurable: a fixed-decay filter with matched rate is exactly Bayes-optimal (machine-precision identity), and the window a local predictor needs tracks how hard the slow variable is to identify — not how slowly it moves.

Looking ahead: this chapter is pilot-B’s home. The two-timescale task defined in §14.6 is the seed of B’s benchmark; Chapter 16 builds the measurement protocol around it (per-layer probing, the evaluation tiers), and the book-side prerequisites for B close there.

14.1 Two limits of sequence modeling

Part of this book’s argument has been that attention and state-space models are not rival species but two corners of one family. Chapter 9 made that literal: a selective SSM is a masked linear-attention computation (Theorem 9.5). Chapter 11 priced the recurrent corner: a finite additive state can hold only so many associations before interference swamps retrieval (Proposition 11.4), and Chapter 12’s delta-rule lineage bought back capacity with smarter writes at the cost of a genuine stability analysis (Theorem 12.4). The attention corner has the opposite ledger: retrieval is exact at any range it can see, and the price is a cache that grows linearly with what it has seen.

The trade is not hypothetical; it is the explicit organizing axis of the recall-throughput frontier mapped by Based Arora et al. (2024) : architectures that fix state size give up recall, architectures that keep recall give up the $O(1)$ -per-token decode that makes long generation cheap. Pure designs sit at the frontier’s ends. The production systems of §14.5 sit deliberately in the middle, and the question this chapter equips you to ask is why the middle is the right place — not as an engineering compromise but as a statement about the sequences being modeled.

The statement is statistical. Real token streams mix two kinds of structure: fast, token-local dependence — syntax, local copying, the bigram-scale texture a short context pins down — and slow latent structure — topic, style, the document-level regime that changes rarely but conditions everything. (This decomposition-by-capability is also field practice: it is the premise of the synthetic-probe methodology of Poli et al. (2024) and Arora et al. (2024) that §14.6 takes up.) A layer that is exact over a short window serves the first; a compressed state that persists serves the second. Neither serves both: the window forgets the regime the moment it scrolls past the evidence, and the compressed state cannot reproduce exact local interactions it never stored. That intuition — two statistical timescales, two architectural resources — is the chapter’s lens. The next section gives it a theorem in the continuous prototype, and §14.6 an exactly computable discrete counterpart.

14.2 The two-timescale lens

The continuous prototype for “fast and slow living together” is a singularly perturbed linear system: a slow variable $x$ and a fast variable $y$ whose own dynamics relax on a timescale $\varepsilon \ll 1$ ,

\dot{x} \;=\; A x + B y, \qquad \varepsilon\,\dot{y} \;=\; C x + D y ,

with $D$ Hurwitz (every eigenvalue in the open left half-plane — a stable fast subsystem, the timescale separation that defines the stiff regime Chapter 6 tamed with implicit methods; Proposition 6.1). Setting $\varepsilon = 0$ formally gives $y = -D^{-1}C x$ : the slow manifold, the surface on which the fast variable has equilibrated to whatever the slow variable is doing. What happens off the surface is the boundary layer: a transient, of duration $O(\varepsilon)$ , in which $y$ collapses onto the manifold.

Theorem 14.1 (Slow-manifold tracking, linear fast–slow systems).

Let $\dot{x} = Ax + By$ , $\varepsilon\dot{y} = Cx + Dy$ with $D$ Hurwitz, so that $\|e^{Dt}\| \le M e^{-\beta t}$ for some $M \ge 1$ , $\beta > 0$ . Define the deviation from the slow manifold $\eta = y + D^{-1}Cx$ . Then for every fixed horizon $T$ and every bounded initial condition, as $\varepsilon \to 0$ :

Boundary layer. $\|\eta(t)\| \le M e^{-\beta t/\varepsilon}\,\|\eta(0)\| + O(\varepsilon)$ — the fast variable reaches an $O(\varepsilon)$ neighbourhood of the manifold in time $O(\varepsilon\log(1/\varepsilon))$ .
Reduced dynamics. The slow variable tracks the reduced system $\dot{x}_r = (A - BD^{-1}C)\,x_r$ , $x_r(0) = x(0)$ , with $\sup_{t \le T}\|x(t) - x_r(t)\| = O(\varepsilon)\,(1 + \|\eta(0)\|)$ .

The proof is a change of variables and one Grönwall estimate. Differentiating $\eta$ along trajectories,

\varepsilon\,\dot{\eta} \;=\; \varepsilon\,\dot{y} + \varepsilon\,D^{-1}C\,\dot{x} \;=\; D\eta \;+\; \varepsilon\,D^{-1}C\,(Ax + By),

since $Cx + Dy = D(y + D^{-1}Cx) = D\eta$ . The first term contracts at rate $\beta/\varepsilon$ ; the second is a bounded forcing of size $O(\varepsilon)$ on bounded trajectories. Variation of constants gives claim 1. For claim 2, substitute $y = -D^{-1}Cx + \eta$ into the slow equation: $\dot{x} = (A - BD^{-1}C)\,x + B\eta$ , so the difference $e = x - x_r$ obeys $\dot{e} = (A - BD^{-1}C)\,e + B\eta$ with $e(0) = 0$ , and Grönwall bounds $\|e(t)\|$ by a constant times $\int_0^t \|\eta\|$ , which claim 1 bounds by $M\|\eta(0)\|\,\varepsilon/\beta + O(\varepsilon)\,T$ . Exercise 14.4 carries the scalar case through explicitly; the nonlinear generalization is Tikhonov’s theorem Kokotović et al. (1986) .

The theorem licenses a reading of the hybrid division of labour — the reading Chapter 12’s closing section previewed:

Attention is boundary-layer machinery. Within its window it computes pairwise interactions exactly — no compression, no model of how the past decays — which is precisely what fast, token-local structure needs, and exactly what a regime transient needs at the moment the slow context shifts.
A decaying state is slow-manifold machinery. It carries a compressed, $O(1)$ -size summary forward — the manifold parameterization — and its forgetting rate plays the role of the timescale assumption: Chapter 9’s selectivity and Chapter 11’s decay masks (Theorem 11.2) are mechanisms for making that rate fit the data’s slow rate.
A hybrid is a matching construction. Sequential stacks, parallel gated blocks, and interleaved layers (§14.3) are different ways of joining the two descriptions — different matching conditions, in the asymptotics vocabulary.

What this lens does and does not claim. The theorem above is a statement about linear fast–slow ODEs, and in §14.6 an exact discrete counterpart is proved for a filtering task. The architectural reading — attention $\approx$ boundary layer, decaying state $\approx$ slow manifold, hybrid $\approx$ matched expansion — is an interpretive map between idealized computations, not a theorem about trained networks: nothing here claims that a trained hybrid implements a matching condition, and no such result exists. The map earns its keep where it is testable: it predicts which restriction fails on which statistics, and §14.6 measures exactly that. Its third leg is weaker, and we say so: the lens supplies the two specialists, not yet a theory of the joint — it does not by itself predict which matching condition (sequential, parallel, gated) wins where. The one careful cross-vendor comparison available — Lee et al.’s finding that sequential hybrids win at short context and parallel hybrids at long context Lee et al. (2025) — is consistent with components that specialize as the map says, and it is precisely the kind of data a theory of the joint would have to explain. Consistency is evidence for a lens, not proof of a mechanism; we will say “the lens predicts,” never “the network performs asymptotic matching.”

14.3 The composition design space

Fix the two primitives the rest of the chapter mixes, in the exact form the companion implements: sliding-window attention — causal softmax attention in which position $t$ attends to positions $t-w+1, \dots, t$ — and a gated-decay SSM, the per-channel EMA recurrence

h_t \;=\; g_t \odot h_{t-1} \;+\; (1 - g_t) \odot x_t, \qquad g_t \in [0,1]^d,

the diagonal skeleton of Chapter 11’s gated linear attention, kept minimal so that every composition claim below is checkable to machine precision. Three composition patterns cover the production space:

Sequential: one primitive feeds the other within a block, $\mathrm{attn}(\mathrm{ssm}(x))$ or $\mathrm{ssm}(\mathrm{attn}(x))$ — the in-series pairing every stacked hybrid uses locally.
Parallel-gated: both read the same input and a gate blends them, $g \odot \mathrm{attn}(x) + (1-g) \odot \mathrm{ssm}(x)$ — the Hymba Dong et al. (2024) /GMU-style pattern, and the carrier of §14.4’s granularity question.
Interleaved at ratio $r\!:\!1$ : a residual stack in which every $(r{+}1)$ -th block is attention and the rest are SSM — the layer-ratio pattern, the design variable Chapter 12 deferred here from Kimi Linear’s 3:1 choice Kimi Team (2025) .

One boundary needs drawing, because the literature cuts this space twice. Sequential here is a within-block statement (primitives composed directly, no residual between them); a repeating block pattern — Griffin’s two recurrent blocks then one local-attention block, Jamba’s one attention block per eight — is an interleave in this vocabulary, whatever its authors call it. Lee et al.’s sequential-vs-parallel dichotomy Lee et al. (2025) is the coarser cut: their “sequential” covers every design in which information passes through the primitives in series — all interleaves included — against parallel-branch designs. We keep the finer three-way split and map their finding onto it where it is used (§14.5).

The companion pins the algebra that makes these well-posed objects of study (measured printouts; the tests pin each identity below $10^{-12}$ ): the band-masked windowed attention equals its per-position oracle to $4.4 \times 10^{-16}$ ; window $w \ge \seqlen$ is full causal attention (identical outputs, exactly); under constant input $\bar{x}$ the EMA matches its closed form $h_t = (1 - g^t)\bar{x}$ to $8.9 \times 10^{-16}$ ; and the parallel gate’s endpoint reductions are exact — $g = 1$ returns the attention branch bitwise, $g = 0$ the SSM branch.

Why does the ratio — rather than, say, the window alone — carry so much design weight? Because decode-time memory is the binding constraint at production context lengths, and the ratio sets it:

Proposition 14.2 (Decode-state accounting).

A stack of $n$ mixing blocks of width $d$ , of which $n_a$ are sliding-window-attention blocks (window $w$ ) and $n - n_a$ are gated-decay SSM blocks, carries decode-time state

\mathrm{state}(f) \;=\; n_a \cdot 2wd \;+\; (n - n_a)\,d \;=\; n\,d\,\bigl(1 + f\,(2w - 1)\bigr) \quad \text{floats}, \qquad f = n_a/n,

independent of the sequence length $\seqlen$ . A pure full-attention stack carries $n \cdot 2\seqlen d$ , growing linearly in $\seqlen$ . Per block, the hybrid budget is linear in the attention fraction $f$ .

The proof is counting: each attention block holds a rolling $w \times d$ key buffer and a $w \times d$ value buffer; each SSM block holds its $d$ -float state. The companion refuses to leave it at arithmetic — it materializes the buffers and asserts the formula against their actual sizes. At a production-shaped configuration ( $n = 24$ , $d = w = 1024$ ), the measured totals are $12{,}601{,}344$ floats at ratio $3\!:\!1$ and $6{,}312{,}960$ at $7\!:\!1$ , against $3{,}221{,}225{,}472$ for full attention at $\seqlen = 65{,}536$ — factors of $255.6$ and $510.3$ .

Log-scale plot of decode state per block in millions of floats versus attention fraction from 0 to 1. A solid curve rises from about a thousand floats at fraction 0 to about 2 million at fraction 1, far below a dashed full-attention reference line at 134 million. Five production models are marked as dots on the curve between fractions 0.085 and 0.5. — The decode-state budget per mixing block as a function of the attention fraction $f$ ($d = w = 1024$), with the full-attention reference at context length $65{,}536$. Production attention fractions, as of May 2026, cluster at the low end: Nemotron-H at $f = 0.085$ ($751.2\times$ below full attention), Bamba $0.094$ ($679.5\times$), Jamba $0.125$ ($510.3\times$), Kimi Linear $0.25$ ($255.6\times$), Samba $0.5$ ($127.9\times$). Curve and points are computed from the same audited cost function. Produced by companions/ch14/jax/hybrid_block.py; the formula is pinned against materialized buffers in tests/test_hybrid_block.py.

Two structural notes close the composition story. First, interleaving is the pattern that preserves trainability: every layer in the stack remains chunkwise-parallelizable — the SSM blocks by Chapter 11’s recurrent↔parallel duality and the delta-rule blocks by the WY machinery of Theorem 12.5 — so the ratio is a pure budget dial, not a training-throughput sacrifice. Second, the window and the ratio are not interchangeable: $w$ buys reach for the exact path, $f$ buys how often the exact path appears in depth. The two-timescale lens says their jobs differ in kind — $w$ must cover the fast structure’s correlation length, while the slow signal needs some carried state at some depth — and §14.6 measures exactly that asymmetry.

14.4 The gating design space

Composition decides where the two paths sit; gating decides, position by position and channel by channel, how much each path contributes. The production designs of §14.5 differ more in their gates than in anything else, and the differences organize cleanly along two axes.

Granularity — what shape the gate is:

Proposition 14.3 (Gate granularity is an expressivity ordering).

Fix the two branch maps and consider parallel-gated blocks $y = g \odot \mathrm{attn}(x) + (1 - g) \odot \mathrm{ssm}(x)$ . The families

\mathcal{G}_{\mathrm{scalar}} = \{\,g \in [0,1]\,\} \;\subsetneq\; \mathcal{G}_{\mathrm{vector}} = \{\,g \in [0,1]^d\,\} \;\subsetneq\; \mathcal{G}_{\mathrm{input}} = \{\,g = g(x) \in [0,1]^d\,\}

are nested as sets of realizable input–output maps, and each inclusion is strict under its own non-degeneracy condition: the first whenever the branches differ on at least two channels at some input; the second whenever the branches differ at two distinct inputs. Endpoints reduce exactly: $g \equiv 1$ gives the attention branch, $g \equiv 0$ the SSM branch; a $\{0,1\}$ -valued vector gate routes each channel to its branch.

The first strictness is a two-line argument: a $\{0,1\}$ vector gate that sends channel $1$ to attention and channel $2$ to the SSM would force a scalar gate to satisfy $g = 1$ and $g = 0$ simultaneously wherever the branches disagree on both channels. The second — input-dependent gates are genuinely stronger than constant ones — needs its two-input condition (a constant vector gate can already realize any single-input behaviour) and is Exercise 14.6. Both conditions hold generically for the actual attention/EMA branches. The companion pins the reductions and the channel-routing witness exactly (differences of $0.0$ , not merely small). The lineup also uses one coarser point on the same axis: block granularity, where the “gate” is the structural choice of which block type occupies which position — a $\{0,1\}$ routing decision made once, at design time, for a whole block.

The ordering is the cheap part; the design question is where on it to sit. A scalar gate costs one number and one design meeting; a vector gate is Chapter 11’s decay-mask vocabulary (Theorem 11.2) applied to branch mixing; input-conditioned vector gates are exactly the move selective SSMs made in Chapter 9, now steering the blend rather than the state. Production examples, as of May 2026: Nemotron-H mixes with a fixed schedule and no per-channel gate NVIDIA (2025) . Gated DeltaNet’s decay gate is the scalar $\gamma_t$ per head Yang et al. (2025) — input-dependent but coarse — and Kimi’s KDA is precisely the per-channel (vector) refinement of it Kimi Team (2025) , the granularity class Griffin’s RG-LRU also occupies De et al. (2024) . SambaY’s Gated Memory Unit is an element-wise (vector) gate whose value is computed from a shared memory stream — input-dependent — while its placement across decoder layers is a fixed structural choice Ren et al. (2025) ; the trigger axis below classifies the value, not the wiring.

Trigger — what the gate responds to. A gate can be set by position (a fixed schedule: every $(r{+}1)$ -th block, Hunyuan TurboS’s macro-block patterns Tencent Hunyuan Team (2025) ), by the input (every selective/gated recurrence above), or — the 2026 frontier — by the model’s own output state: AMOR invokes attention only at positions where prediction entropy crosses a threshold, reporting attention on roughly 22% of tokens with long-context stability Zheng & Shani (2026) . The trigger axis has a sharp theoretical caution attached: Basu’s analysis of content-based routing argues that mechanisms which avoid pairwise token comparison cannot route precisely (1–29% routing accuracy across recurrent, memory-bank, and bandit-style gates) — high-precision routing needs pairwise machinery, the representational ingredient cheap gates omit, though not necessarily full attention: the same paper constructs a 99.7%-accurate router from a rank-one pairwise projection at linear cost Basu (2026) .

Categorical scatter plot with gate granularity on the horizontal axis (scalar, vector, block) and gate trigger on the vertical axis (fixed schedule, input-dependent, output-triggered). Nine architectures are placed as labeled dots: Nemotron-H and Jamba and Hunyuan TurboS along the fixed-schedule row, Gated DeltaNet and SambaY and Griffin and Kimi Linear and Hymba on the input-dependent row, AMOR alone on the output-triggered row. — The gating design space, as of May 2026: granularity (scalar / vector / block) against trigger (fixed schedule / input-dependent / output-triggered), with nine named designs placed per their papers' descriptions (the trigger classifies the gate's value, not its wiring). The lower rows are crowded and the top row is nearly empty — output-triggered gating (AMOR) is the frontier, and Basu's routing bound says filling it honestly requires pairwise machinery. Schematic placement (no measured quantity); produced by companions/ch14/jax/hybrid_block.py.

Two families that belong to this design space by shape are treated elsewhere: xLSTM’s exponential gates (granularity: vector; trigger: input; plus a stabilizer state the sigmoid families do not need) and the test-time memory writes of the Titans line (trigger: the loss, a fourth trigger class) are Chapter 13’s subjects, where their matrix-memory context lives.

14.5 The production lineup, as of May 2026

The lens and the design space were built to read a real lineup, so here it is.

Mix pattern uses §14.3’s vocabulary; gate class uses §14.4’s.

| Model | Mix pattern (attention fraction $f$ ) | Gating | Reported headline (paper’s own) | |---|---|---|---| | Griffin De et al. (2024) | interleaved, 2 RG-LRU : 1 local attention ( $f = 1/3$ ) | vector, input-dep. | matches Llama-2 quality with faster long-sequence inference | | Samba Ren et al. (2024) | interleaved 1:1 Mamba + sliding window ( $f = 1/2$ ) | vector (Mamba’s) | linear-time training at 4K, length extrapolation to 256K context | | Jamba Lieber et al. (2024) | interleaved 1:7 attention:Mamba, MoE FFNs ( $f = 1/8$ ) | block-structural | 256K context at 12B-active scale | | Bamba (IBM, HF blog) Bamba Team (IBM, Princeton, CMU, UIUC) (2024) | interleaved 3:29 attention:Mamba-2 ( $f \approx 0.09$ ) | block-structural | fully open data recipe; ~2.5× vLLM decode throughput vs comparable transformers | | Nemotron-H NVIDIA (2025) | interleaved, 10 attention of 118 layers ( $f \approx 0.08$ ) | scalar schedule | ~3× faster inference than similarly sized transformers at comparable accuracy | | SambaY / Phi-4-mini-flash Ren et al. (2025) | Samba self-decoder + GMU cross-decoder, single full-attention layer | vector (GMU) | ~10× decoding throughput on long-generation reasoning | | Hunyuan TurboS Tencent Hunyuan Team (2025) | AMF/MF macro-blocks (attention–Mamba–FFN), MoE | block schedule | top-10 LMSYS arena placement at 56B activated | | Kimi Linear Kimi Team (2025) | interleaved 3:1 KDA:full attention (MLA) ( $f = 1/4$ ) | vector, input-dep. | 75% KV-cache reduction, up to ~6× decode at 1M context |

Read the table against Proposition 14.2 and the figure that draws it: every row sits low on the attention-fraction curve, exactly where the budget argument says the cheap exactness should be spent. The within-family motion is also informative: Samba → SambaY replaces cross-attention with a gate (GMU) and gains its throughput from the gating design space, not from a new mixing pattern; Jamba → Bamba → Nemotron-H is a march down the attention fraction as the slow-path layers improved (Mamba-1 → Mamba-2 Dao & Gu (2024) ); Kimi reaches the same neighbourhood from the delta-rule side, with Chapter 12’s gated-delta machinery as the slow path.

One cross-vendor regularity deserves its own sentence. Lee et al., comparing sequential against parallel mixing across vendors and scales, find sequential hybrids stronger at short context and parallel hybrids stronger at long context Lee et al. (2025) — the single most design-relevant empirical fact in the lineup. (Their dichotomy is the coarser cut §14.3 reconciled: “sequential” there spans the interleaves above.) Per §14.2’s flag, it is a consistency check for the matching lens — the kind of regularity a theory of the joint would have to explain — not a verification of it.

14.6 MAD and the two-timescale benchmark seed

How should claims like the table’s be tested? Full-scale training runs answer with averages over everything at once. The mechanistic-architecture- design (MAD) methodology Poli et al. (2024) answers with small, synthetic, capability-isolating tasks — in-context recall, compression, copying — chosen so that performance on the probe predicts scaling behaviour of the full architecture; the MQAR family Arora et al. (2024) did the same for associative recall specifically, and Chapter 11 already used it to measure the capacity wall. This section contributes the probe the two-timescale lens calls for, and it is pilot B’s seed.

The task. A hidden Markov model with the two timescales built in:

a slow regime $z_t \in \{1, \dots, K\}$ with sticky transition $T(\varepsilon) = (1-\varepsilon)\,I + \tfrac{\varepsilon}{K-1}(J - I)$ — stay with probability $1 - \varepsilon$ , else jump uniformly ( $J$ the all-ones matrix);
a fast process: given the regime, tokens follow a regime-specific bigram table, $P(x_{t+1} \mid z_{t+1} = j, x_t) = B_j[x_t, \cdot]$ on a vocabulary of size $V$ ;
an overlap dial $\eta \in (0,1]$ : the tables interpolate between a shared table and regime-specific ones, $B_j = (1-\eta)C + \eta D_j$ , controlling how hard the regimes are to tell apart locally — measured by the mean per-token discrimination $\bar{\imath}$ (the average pairwise KL between regime rows).

Because the model is known, every predictor below is exact — no training, no approximation other than the stated information restriction — so every claim is a computation, checkable to machine precision. The Bayes-optimal predictor is the HMM forward filter, and the companion validates it against a brute-force enumeration over every regime path — $K^{t+1}$ paths for the prediction at position $t$ , an independent code path that is only feasible on short validation instances ( $\seqlen = 7$ ): maximum prediction difference $1.1 \times 10^{-16}$ .

Three restrictions mirror the architectural resources:

window- $w$ (the attention idealization): the same Bayes computation, restricted to the last $w$ tokens with a uniform regime prior at the window edge — exact pairwise machinery, zero carried state. With $w \ge \seqlen$ it reproduces the full filter exactly (difference $0.0$ ).
fixed-decay (the state idealization): the filter run with the mixing-to-uniform prior update $M(\lambda) = (1-\lambda)I + \tfrac{\lambda}{K}J$ — a compressed state forgotten toward uniform at a fixed rate, the EMA of §14.3 wearing filtering clothes.
unigram (the slow-manifold-only idealization): regime tracking with each bigram row replaced by the regime’s stationary unigram — carried state, no fast structure.

The fixed-decay restriction is the one with a theorem in it:

Theorem 14.4 (The matched decay is exactly Bayes-optimal).

For every $K \ge 2$ and $\varepsilon \le (K-1)/K$ ,

T(\varepsilon) \;=\; M(\lambda^\star) \qquad \text{at} \qquad \lambda^\star \;=\; \frac{\varepsilon K}{K - 1},

entrywise. Consequently the fixed-decay filter with rate $\lambda^\star$ is the exact Bayes filter for the two-timescale task: a fixed-decay state model is optimal when its forgetting rate matches the regime’s switching rate, and every other fixed $\lambda$ is the Bayes filter for a misspecified switching rate — strictly costly whenever the regimes are distinguishable at all (the measured V-shape below; in the degenerate identical-tables limit every $\lambda$ ties).

The proof is two lines of algebra on the diagonal and off-diagonal entries (Exercise 14.5). The companion measures the identity at $1.1 \times 10^{-16}$ entrywise and the resulting prediction agreement between the matched-decay filter and the optimal filter at $3.3 \times 10^{-16}$ . This is the chapter’s sharpest statement of the slow-manifold half of the lens: gating is timescale matching. Chapter 9’s selectivity and Chapter 11’s decay masks let a model learn $\lambda$ ; the theorem says what the target of that learning is, and the V-shaped mistiming cost below says what missing it costs.

What each restriction costs, measured. At the committed reference configuration ( $K = 4$ , $V = 12$ , $\eta = 0.4$ , $\varepsilon = 0.02$ , $\seqlen = 8192$ , burn-in $128$ , $\bar{\imath} = 0.534$ nats/token), the optimal filter achieves $1.9289$ nats. The window-8 predictor pays an excess of $0.0246$ nats; a fixed decay matched to the wrong rate ( $\lambda^\star(0.2)$ where the truth is $\varepsilon = 0.02$ ) pays $0.0512$ ; the unigram predictor pays $0.4965$ — twenty times the window’s tax. Dropping the fast structure is catastrophic; carrying the wrong-decay state or a modest window is merely costly.

Two stacked panels sharing a log-scale switch-probability axis. Top panel: excess cross-entropy of the window-8 predictor falls from 0.058 nats at the slowest regimes toward zero at fast regimes; window-64 sits at zero throughout; the fixed-decay curve is V-shaped, touching zero at the matched switch rate 0.05 and rising on both sides. Bottom panel, on a ten-times larger scale: the unigram predictor's excess stays between 0.41 and 0.55 nats across all switch rates. — Excess cross-entropy over the exact filter as the regime speed $\varepsilon$ varies (overlap $\eta = 0.4$, $\seqlen = 8192$). Top: the window-8 predictor's tax is largest for the slowest regimes — carried state is worth most exactly when the latent moves slowest — while window-64 is indistinguishable from optimal here, and the fixed-decay filter (rate matched at $\varepsilon = 0.05$) shows the V-shaped mistiming cost that vanishes only at its matched point, the measured face of the matched-decay theorem. Bottom (note the scale): discarding fast structure costs ~0.5 nats regardless of regime speed. Differences below about $2 \times 10^{-4}$ nats are at the single-sequence Monte-Carlo floor. Produced by companions/ch14/jax/two_timescale.py; the reference-configuration values are pinned in tests/test_two_timescale.py.

The third timescale, and the benchmark design lesson. The window curves above collapse to zero by $w = 64$ . Why so small? Because with $\eta = 0.4$ the regimes are still locally distinguishable enough that a few dozen tokens identify the current regime — the identification timescale $\tau_{\mathrm{id}}$ is short. The overlap dial exists to control exactly this, and the crossover figure sweeps it:

Excess cross-entropy versus window size on a log-2 axis from 2 to 256, one curve per overlap value. The sharp-regime curve starts highest at window 2 but plunges to zero by window 8. The intermediate-overlap curve crosses it, decaying to zero near window 32. The low-overlap curve starts lowest but decays slowest, still positive at window 64. — The window a local predictor needs tracks identification difficulty, not regime speed ($\varepsilon = 0.005$ fixed). With sharply distinct regimes ($\eta = 1.0$, $\bar{\imath} = 3.04$ nats/token) a window of 8 is already near-optimal; at $\eta = 0.4$ ($\bar{\imath} = 0.53$) the tax persists to $w \approx 32$; at $\eta = 0.2$ ($\bar{\imath} = 0.21$) it persists past $w = 64$. The curves cross: harder identification lowers the value of regime knowledge but stretches how long a window must be to acquire it. Produced by companions/ch14/jax/two_timescale.py; the legend's discrimination values and the $\eta$-monotonicity are pinned in tests/test_two_timescale.py.

The design lesson is the section’s payload, and it is a constraint on benchmarks rather than on architectures: a two-timescale task separates fast-path from slow-path machinery only when the three timescales are ordered $w \ll \tau_{\mathrm{id}} \ll 1/\varepsilon$ . Make the regimes too distinguishable ( $\eta \to 1$ ) and a sliding window solves everything — the benchmark measures nothing about carried state. Make them too similar and even the optimal filter gains little — there is nothing to measure at all. Pilot B’s benchmark must therefore report $\bar{\imath}$ (or place $\tau_{\mathrm{id}}$ directly) as a task parameter alongside $\varepsilon$ , exactly as this companion does. One measurement is deliberately deferred with the protocol: the composite restriction — a window- $w$ filter seeded at its edge with a decayed carried prior instead of the uniform one — is the idealized matched expansion itself, the natural probe of §14.2’s third leg, and it belongs with B’s protocol work rather than this chapter’s seed. What else does not transfer to Chapter 16 from here: the per-layer probing protocol that asks where a trained hybrid stores the regime posterior, the evaluation tiers around real benchmarks, and B’s five-axis decomposition — that is Chapter 16’s subject.

14.7 What’s next

This chapter mixed the layer families of Chapters 9–12 into hybrids and read the mixture through a fast–slow lens with one continuous theorem (Theorem 14.1) and one discrete one (Theorem 14.4) holding the interpretation to account. Its sibling beyond-SSM chapter, Chapter 13 (already behind you if reading in order), lives inside the gates themselves: exponential gating with its stabilizer states (xLSTM) and the generalized delta rule (RWKV-7) extend Chapter 12’s lineage to matrix memories with their own stability questions. Chapter 15 plays prosecution — the counter-evidence file on what SSM-heavy designs provably cannot do, and the diagnostic toolkit for detecting it. Chapter 16 builds the evaluation methodology this chapter’s benchmark seed feeds into — the protocol around the task, the per-layer probes, and the point where pilot B’s book-side prerequisites close.

14.8 Exercises

Three short problems (solutions inline) and three longer ones (solutions in §14.9).

Exercise 14.1 (short)

A 32-block hybrid has width $d = 2048$ , window $w = 2048$ , and ratio $7\!:\!1$ (so $n_a = 4$ ). Using Proposition 14.2, compute the decode state in floats, the same stack’s full-attention cache at $\seqlen = 131{,}072$ , and the ratio. At what context length $\seqlen^\star$ does full attention’s cache equal the hybrid’s?

Solution

State $= 4 \cdot 2 \cdot 2048 \cdot 2048 + 28 \cdot 2048 = 33{,}611{,}776$ floats. Full attention: $32 \cdot 2 \cdot 131{,}072 \cdot 2048 = 17{,}179{,}869{,}184$ — a factor of $511.1$ . Setting $32 \cdot 2\seqlen d = 33{,}611{,}776$ gives $\seqlen^\star = 256.4$ : past a few hundred tokens of context, this hybrid already stores less than full attention ever would.

Exercise 14.2 (short, code)

Run companions/ch14/jax/two_timescale.py. Report the reference-configuration excesses (window-8, mistimed decay, unigram), then change _FIG_OVERLAP to $1.0$ and to $0.2$ and report how the window-8 excess at $\varepsilon = 0.005$ moves. One sentence: why does the hardest-to-identify task show the smallest window-8 excess?

Solution

Reference: $0.0246$ (window-8), $0.0512$ (decay mistimed by a factor 10), $0.4965$ (unigram). From the script’s printed crossover sweep, the window-8 excess at $\varepsilon = 0.005$ is $0.0038$ at $\eta = 1.0$ , $0.0496$ at $\eta = 0.4$ , $0.0240$ at $\eta = 0.2$ . Identification difficulty cuts both ways: at $\eta = 0.2$ the regimes are hard to identify in a window, but knowing the regime is also worth less (the tables nearly coincide), so the achievable excess shrinks even as the window needed to remove it grows.

Exercise 14.3 (short)

Let $a = \mathrm{attn}(x)$ and $s = \mathrm{ssm}(x)$ be the two branch outputs at one position, and suppose $a \ne s$ in every channel. Show that the set of outputs reachable by a scalar gate is a line segment in $\mathbb{R}^d$ , that the vector-gate reachable set is the full axis-aligned box with corners $a$ and $s$ , and conclude the dimension counts ( $1$ versus $d$ ) behind Proposition 14.3.

Solution

Scalar: $\{g a + (1-g) s : g \in [0,1]\}$ is the segment from $s$ to $a$ . Vector: channel $j$ of the output is $g_j a_j + (1-g_j) s_j$ , which ranges over the interval with endpoints $s_j, a_j$ independently of other channels — the product of $d$ nondegenerate intervals, i.e. the box. A segment is one-dimensional; the box is $d$ -dimensional, so for $d \ge 2$ the inclusion is strict — and the $\{0,1\}$ corners of the box are exactly the channel-routing witnesses the companion pins.

Exercise 14.4 (theory) — solution in §14.9

Prove Theorem 14.1 in the scalar case $\dot{x} = ax + by$ , $\varepsilon\dot{y} = cx + dy$ with $d < 0$ : solve the $\eta = y + (c/d)x$ dynamics exactly, exhibit the boundary layer, and bound $|x - x_r|$ on $[0, T]$ explicitly, identifying where each hypothesis ( $d < 0$ , fixed $T$ , bounded initial data) is used.

Exercise 14.5 (theory) — solution in §14.9

Prove Theorem 14.4, then compute the spectrum of $T(\varepsilon)$ and of $M(\lambda)$ , the stationary distribution of each, and the mixing time of the regime chain. Conclude that for large $K$ the matched rate $\lambda^\star \approx \varepsilon$ : forgetting toward uniform at the switching rate.

Exercise 14.6 (theory) — solution in §14.9

Complete the second strictness in Proposition 14.3: construct two inputs $x^{(1)}, x^{(2)}$ and a target map realizable by an input-dependent vector gate but by no constant vector gate. Then explain why a gate computed from another layer’s state (the GMU pattern) sits in the input-dependent class.

14.9 Full solutions to theory exercises

Solution to Exercise 14.4

With $\eta = y + (c/d)x$ the fast equation becomes, exactly as in the matrix case,

\varepsilon\dot{\eta} \;=\; d\,\eta \;+\; \varepsilon\,\frac{c}{d}\,(ax + by),

a scalar linear ODE with constant contraction rate $|d|/\varepsilon$ (here $d < 0$ is used — without it $\eta$ grows and there is no layer). Variation of constants:

\eta(t) \;=\; e^{dt/\varepsilon}\eta(0) \;+\; \frac{c}{d}\int_0^t e^{d(t-s)/\varepsilon}\,(ax + by)(s)\,ds .

On $[0, T]$ the trajectory $(x, y)$ is bounded uniformly in $\varepsilon$ , say by $R$ — and this is where $d < 0$ earns its keep a second time. Fixed horizon and bounded data alone give only an $\varepsilon$ -dependent bound (the system matrix has $O(1/\varepsilon)$ entries); with $d < 0$ , the contraction keeps $|\eta(t)| \le |\eta(0)| + O(\varepsilon R)$ , hence $|y| \le |c/d|\,|x| + |\eta(0)| + O(\varepsilon R)$ , the slow equation then bounds $|x|$ by an $\varepsilon$ -independent Grönwall constant, and a standard bootstrap (pick $R$ dominating both estimates; continuity keeps the trajectory inside) closes the loop. Without $d < 0$ the fast variable blows up as $\varepsilon \to 0$ and no uniform $R$ exists. With $R$ in hand, $|(ax + by)(s)| \le (|a| + |b|)R$ , so the integral term is at most $\tfrac{|c|}{|d|}(|a| + |b|) R \cdot \tfrac{\varepsilon}{|d|}(1 - e^{dt/\varepsilon}) \le \tfrac{|c|(|a| + |b|) R}{d^2}\,\varepsilon$ : the boundary-layer estimate $|\eta(t)| \le e^{-|d| t/\varepsilon}|\eta(0)| + O(\varepsilon)$ , with the layer over by $t \sim \varepsilon\log(1/\varepsilon)$ . Substituting $y = -(c/d)x + \eta$ into the slow equation gives $\dot{x} = (a - bc/d)x + b\eta$ , so $e = x - x_r$ obeys $\dot{e} = (a - bc/d)e + b\eta$ , $e(0) = 0$ , and

|e(t)| \;\le\; |b|\,e^{|a - bc/d|\,T} \int_0^T |\eta(s)|\,ds \;\le\; |b|\,e^{|a - bc/d|\,T}\Bigl(\frac{\varepsilon}{|d|}|\eta(0)| + \frac{|c|(|a| + |b|) R}{d^2}\,\varepsilon\,T\Bigr) \;=\; O(\varepsilon),

where the fixed horizon $T$ is used to keep the Grönwall factor a constant. Each piece of the theorem’s statement appears with explicit constants. $\blacksquare$

Solution to Exercise 14.5

The identity. Diagonal entries: $T(\varepsilon)_{ii} = 1 - \varepsilon$ and $M(\lambda)_{ii} = 1 - \lambda + \lambda/K$ ; equality forces $\lambda(1 - 1/K) = \varepsilon$ , i.e. $\lambda^\star = \varepsilon K/(K-1)$ . Off-diagonal: $T_{ij} = \varepsilon/(K-1)$ and $M_{ij} = \lambda/K$ , and at $\lambda^\star$ indeed $\lambda^\star/K = \varepsilon/(K-1)$ . Both matrices are row-stochastic, so matching all entries matches the matrices; the constraint $\lambda^\star \le 1$ is $\varepsilon \le (K-1)/K$ . Since the Bayes filter’s only use of the regime chain is the prior update $p \mapsto p\,T(\varepsilon)$ , replacing $T$ by $M(\lambda^\star)$ changes nothing at any step: the fixed-decay filter at the matched rate is the Bayes filter, identically. $\blacksquare$

Spectra and mixing. Both matrices are of the form $\alpha I + \beta J$ : eigenvalues $\alpha + K\beta$ on the all-ones vector and $\alpha$ on its $(K-1)$ -dimensional complement. For $T(\varepsilon)$ : eigenvalue $1$ (stochasticity) and $1 - \varepsilon - \varepsilon/(K-1) = 1 - \lambda^\star$ with multiplicity $K - 1$ ; for $M(\lambda)$ : $1$ and $1 - \lambda$ . The stationary distribution of each is uniform (both are doubly stochastic). The regime chain forgets its initial condition at geometric rate $1 - \lambda^\star$ per step — mixing time $\sim 1/\lambda^\star$ — which is the precise sense in which $1/\varepsilon$ is the slow timescale. As $K \to \infty$ , $\lambda^\star = \varepsilon K/(K-1) \to \varepsilon$ : forget toward uniform at exactly the switching rate. $\blacksquare$

Solution to Exercise 14.6

Take $d = 1$ (one channel suffices) and any two inputs with $a^{(1)} := \mathrm{attn}(x^{(1)}) \ne \mathrm{ssm}(x^{(1)}) =: s^{(1)}$ and likewise $a^{(2)} \ne s^{(2)}$ . Define the target map to output the attention branch on $x^{(1)}$ and the SSM branch on $x^{(2)}$ . A constant gate $g \in [0,1]$ must satisfy $g a^{(1)} + (1-g) s^{(1)} = a^{(1)}$ , forcing $g = 1$ , and $g a^{(2)} + (1-g) s^{(2)} = s^{(2)}$ , forcing $g = 0$ — contradiction. An input-dependent gate with $g(x^{(1)}) = 1$ , $g(x^{(2)}) = 0$ realizes the target, so the inclusion is strict. (With $d \ge 2$ the same argument runs per channel.)

A GMU-style gate is computed from the hidden state of another layer — a deterministic function of the network input — so as an input–output map it is exactly an input-dependent vector gate: $g = g(x)$ , with the cross-layer sharing changing where the gate’s information comes from (a memory stream computed once and reused) but not the function class. That is why SambaY’s throughput gain is a systems gain — fewer recomputations of expensive context — while its expressivity class is the one already characterized here. $\blacksquare$

14.10 Companion code

The companions live in companions/ch14/{jax,torch} and are float64 throughout. The JAX modules are canonical and produce all four figures; the torch modules mirror the hybrid block and the filter family with cross-framework parity pinned to $< 10^{-9}$ (tokens drawn once and fed to both implementations). No Julia module this chapter: the numerical core is filtering linear algebra, not the integrator analysis that earned Julia its seat in Chapters 10 and 12.

# JAX (canonical; emits the four figures to public/figures/ch14/)
PYTHONPATH=. python companions/ch14/jax/hybrid_block.py
PYTHONPATH=. python companions/ch14/jax/two_timescale.py

# Tests: JAX identities (< 1e-12), torch parity (< 1e-9)
.venv/bin/pytest companions/ch14 -q

hybrid_block.py — sliding-window attention with its per-position oracle, the gated-decay EMA and its closed form, the three composition patterns with exact gate reductions, the $r\!:\!1$ schedule, and the decode-cost accounting audited against materialized buffers; produces the cost-frontier and design-space figures (§§14.3–14.5).
two_timescale.py — the two-timescale HMM with the overlap dial, the exact forward filter validated against brute-force path enumeration, the window / fixed-decay / unigram restrictions, the matched-decay identity at machine precision, and the discrimination diagnostic $\bar{\imath}$ ; produces the two-timescale-error and window-crossover figures (§§14.2, 14.6). This module is pilot B’s seed artifact.
torch/ — eager mirrors of both modules plus the buffers-vs-Parameters distinction in a minimal gated-mix layer (the learned blend is a Parameter; the fixed decay rates are a buffer).

Part IV · Integration Week 15 Published

Counter-evidence and diagnostic tools: where SSMs fail

The prosecution's file — what fixed-state and SSM-heavy designs provably cannot do (the TC⁰ ceiling, the illusion of state, the copying separation) and the diagnostic toolkit that measures stability and capacity on real systems, from an architecture-agnostic information-counting bound through the Benettin–QR Lyapunov estimator and its resolution limit to effective state size as a regime diagnostic, all validated against known ground truth.

Counter-evidence and diagnostic tools: where SSMs fail

Chapter 15 — at a glance

Goal: assemble the prosecution’s file. The book has spent fourteen chapters building the case for state-space and linear-recurrent models; this chapter is the counter-evidence — what fixed-state designs provably cannot do, and the diagnostic toolkit that measures, on a concrete system, how close it sits to those limits. Two threads: the impossibility results (the $\mathsf{TC}^0$ ceiling, the illusion of state, the copying separation) that we cite but do not re-prove, and the diagnostics (a capacity bound, Lyapunov exponents, effective state size) that we build and validate against known ground truth.

Reading time: ~45 minutes prose; ~2 hours with the companions.

Key insight: an architecture’s promise is an asymptotic ceiling; a diagnostic is what tells you where a concrete system sits beneath it. The deep impossibility theorems say what no fixed-state recurrence can do as length grows; the one bound we can prove ourselves — a pigeonhole count — is their honest, weaker shadow. The diagnostics then turn the book’s stability thread into measurements: the Lyapunov spectrum of a trained-like recurrence, and an effective state size that counts its memory modes.

Looking ahead: Chapter 15 is not a pilot anchor (C1 closed at Chapter 10, B at Chapter 16), but it builds the instruments pilot B uses. Every diagnostic here runs on constructed systems with known spectra — applying them to actually trained networks is B’s empirical program, and weighing the prosecution’s evidence against the defense is Chapter 17’s synthesis. The diagnostics are the tools; the verdict is deferred.

15.1 The prosecution’s file

Three earlier chapters filed their counter-evidence here. Chapter 13 closed by naming this chapter “the counter-evidence file: it takes the stability questions raised here — when does a matrix memory’s recurrence stay bounded, and what can it provably not do — and turns them into diagnostics and impossibility results.” Chapter 14 cast it as the prosecution to its own defense of hybrids. Chapter 16 left “the impossibility side of the story — what else fixed state provably cannot do” explicitly to “Chapter 15’s file.” This chapter discharges all three.

The lens the book has used — a sequence layer is a dynamical system, read through its state — obliges two kinds of counter-evidence, and they are different in character:

Impossibility. Some tasks are out of reach not for want of tuning but for structural reasons: a fixed-size state is an information bottleneck, and a parallelizable recurrence sits in a circuit-complexity class that excludes genuinely sequential computation. These are theorems about every model of a given shape, and the strongest of them — the $\mathsf{TC}^0$ ceiling, the copying separation — are deep results we cite and attribute, never re-derive (§15.2). The one impossibility we can prove in a page is a counting bound (§15.3); it is deliberately weaker than the cited theorems, and saying exactly how it is weaker is half its value.
Diagnosis. Impossibility says what no such model can do; it does not tell you where this trained model sits. For that you need instruments that measure a concrete recurrence: its stability (does the state stay bounded? — Lyapunov exponents, §15.4) and its usable memory (how many modes actually persist? — effective state size, §15.5). These are the numerical-analyst’s tools, and the discipline of the chapter is to validate each against a system whose answer we know by construction before trusting it on a system we don’t.

The boundary between the two threads is the boundary between cited and proven, and the chapter marks it everywhere: a result in a shaded box with an attribution is someone else’s theorem; a result in a numbered proposition with a proof (or an exercise pointing to one) is ours, and every number it claims is measured by a companion.

15.2 The expressivity ceiling: TC⁰ and the illusion of state

The deepest counter-evidence is complexity-theoretic, and it cuts both ways. A transformer with logarithmic-precision arithmetic and a state-space model are, to first order, in the same computational class — and that class is small.

Cited results: the expressivity ceiling (stated, not proven here)

These are theorems of others, quoted without proof; our companions measure only their self-contained shadow (§15.3).

(Log-precision transformers lie in uniform $\mathsf{TC}^0$ .) Of Merrill & Sabharwal Merrill & Sabharwal (2023) : a transformer with $O(\log n)$ -bit precision on length- $n$ inputs can be simulated by constant-depth, polynomial-size threshold circuits. Under standard assumptions ( $\mathsf{TC}^0 \ne \mathsf{NC}^1$ ) this excludes inherently sequential problems — evaluating arbitrary Boolean formulas, solving linear systems, composing permutations.
(State-space models share the ceiling — the illusion of state.) Of Merrill, Petty & Sabharwal Merrill et al. (2024) : despite carrying a “state,” linear/SSM recurrences computable in parallel also fall in $\mathsf{TC}^0$ , and provably cannot express state-tracking that $\mathsf{TC}^0$ excludes — composing permutations from the symmetric group $S_5$ , tracking chess moves, evaluating code. The recurrent state is real, but its expressive reach is no larger than the non-recurrent model’s.

The phrase “illusion of state” is precise: a recurrence that the associative scan of Chapters 8–9 parallelizes is, by that very parallelizability, confined to the shallow-circuit class. The state buys linear-time inference and a place to put memory; it does not buy sequential expressivity. This is the formal counter-weight to a claim Chapter 13 flagged and deferred: RWKV-7’s paper reports that it “recognizes all regular languages,” apparently reaching above the ceiling. The reconciliation is in the fine print — that result holds for a fixed network with precision and depth allowed to grow with the automaton, not within the uniform, fixed-precision regime the impossibility theorems assume.

The copying separation makes the abstract ceiling concrete. Of Jelassi et al. Jelassi et al. (2024) : a two-layer transformer copies strings of length exponential in its size, while a fixed-state recurrence is bounded by what its state can hold — “structural, not a tuning issue.” That structural bound is the one piece of the impossibility story we can prove ourselves, and §15.3 does.

15.3 The capacity bound: a counting argument

Chapter 11 proved a rank wall for linear attention (Proposition 11.4) and Chapter 16 measured a recall cliff for an exact-capacity slot model (Proposition 16.1). Both are instances of one architecture-agnostic fact: a finite state is a finite communication channel, and no amount of cleverness lets a channel carry more than its capacity.

Proposition 15.1 (Information-counting bound on lossless recall).

Let a deterministic sequence model process inputs by a recurrence carrying a hidden state $s_t$ in a set $\mathcal S$ representable in $d$ coordinates at $b$ bits each, so $|\mathcal S| \le 2^{db}$ . Suppose a task requires the model, after reading a prefix, to answer queries whose correct answers differ across a family $\mathcal X$ of prefixes that must be distinguished. If $|\mathcal X| > 2^{db}$ , then two distinct prefixes induce the same post-prefix state, and the model answers them identically — wrong on at least one. In particular, losslessly recalling length- $n$ sequences over an alphabet $\Sigma$ requires

d\,b \;\ge\; n \log_2 |\Sigma| \qquad\Longleftrightarrow\qquad n \;\le\; \frac{d\,b}{\log_2 |\Sigma|} .

The proof is pure pigeonhole (Exercise 15.4): the post-prefix state is a deterministic function $\sigma : \mathcal X \to \mathcal S$ , every later output factors through $\sigma$ , and $|\mathcal X| > |\mathcal S|$ forces a collision. Specializing to verbatim copy, the $|\Sigma|^n$ length- $n$ strings must all be distinguished, giving $2^{db} \ge |\Sigma|^n$ .

Two things make this honest rather than impressive. First, it is weaker than the cited theorems: it bounds lossless recall by counting, where Jelassi et al. exhibit the separation empirically and Merrill et al. place it in a complexity hierarchy. A real model also fails below this bound — interference, finite training, and lossy encodings mean the usable capacity is a fraction of $d\,b$ . The counting bound is the ceiling of the ceiling. Second, it is architecture-agnostic: Chapter 11’s rank- $d_k$ state ( $d = d_k$ , with $b$ the bits of a stored coordinate) and Chapter 16’s $d$ -slot store are the two instances where the abstract $\mathcal S$ becomes concrete, and the chapter derived each separately. Here they are one count.

The companion realizes the bound on Chapter 16’s slot model, where each of $d$ slots holds one token-pair, i.e. $b = \log_2(\text{vocab})$ bits. The abstract budget $d\,b$ then meets the lossless requirement $n\log_2(\text{vocab})$ exactly at $n^\star = d$ , and the measured recall cliff sits there.

Two panels. Left: recall accuracy versus number of stored pairs N on a log-2 axis, for a d = 16 slot model; the exact curve min(1, d/N) is flat at 1 up to N = 16 then falls as d/N, with measured slot-reader points lying exactly on it and a dashed vertical line at the capacity threshold N = 16. Right: bits versus sequence length n; a rising blue line for the required n log2 of the alphabet size crosses a flat red line for the state budget d times b equal to 96 bits at n = 16, marked with an orange dot. — The information-counting bound (§15.3) on Chapter 16's slot model. Left: a $d = 16$ store recalls perfectly up to load $N = d$ and degrades as $\min(1, d/N)$ beyond; the measured slot reader matches the exact closed form to $0$ at every load (8 seeds), and the cliff sits exactly at the capacity threshold $n^\star = d = 16$. Right: with $b = \log_2(\text{vocab}) = 6$ bits per slot the state budget $d\,b = 96$ bits meets the lossless requirement $n\log_2|\Sigma|$ at $n^\star = 16$ — the same load. Produced by companions/ch15/jax/copying_bound.py (which imports Chapter 16's slot model); pinned in tests/test_copying_bound.py.

15.4 Lyapunov diagnostics: measuring stability on a concrete system

Impossibility is about all models; the rest of the chapter is about this one. The first instrument measures stability. Chapter 2 defined a system’s Lyapunov exponents as the asymptotic log-growth rates of its state (Definition 2.2, Theorem 2.1) and computed them with the Benettin QR algorithm Benettin et al. (1980) ; we reuse that engine unchanged and ask what it can and cannot tell us about a recurrence whose transition we did not choose — a trained-like SSM layer, the matrix memory of Chapter 13, a selective scan.

Proposition 15.2 (Recovery and resolution of the Lyapunov estimator).

Let $\{J_t\}$ be the per-step Jacobian sequence of a recurrence, and let $\hat\lambda_1 \ge \cdots \ge \hat\lambda_d$ be the Benettin–QR estimate over a trace of length $T$ .

(Recovery.) If the system is autonomous with a diagonalizable transition $J$ , then $\hat\lambda_i \to \log|\lambda_i(J)|$ as $T \to \infty$ ; the estimate is a time average, so its error decays like $O(1/T)$ , and the top exponent (largest spectral gap) converges fastest.
(Divergence identity.) For any $\{J_t\}$ and every $T$ , $\sum_i \hat\lambda_i = \frac1T \sum_t \log|\det J_t|$ exactly, because the QR triangular factors satisfy $\prod_i |R^{(t)}_{ii}| = |\det J_t|$ . The sum of the exponents is the robust summary that survives even when the individual exponents do not.
(Resolution limit.) When several modes share a modulus in a non-normal recurrence, the QR frame cannot isolate the individual exponents — they carry $O(\varepsilon)$ splitting noise — though their mean stays pinned by part 2.

Part 1 is the multiplicative-ergodic / QR-convergence theorem specialized to a constant map; part 2 is exact each step (Exercise 15.5); part 3 is the subtlety that matters in practice, and it is not about degeneracy alone. The companion makes the distinction sharp. A diagonal-plus-rank-one transition from Chapter 13 (Proposition 13.1), symmetric with distinct moduli, is recovered to a measured $2.1\times10^{-4}$ over $T = 4000$ steps (top exponent to $2.3\times10^{-6}$ — fastest, as promised); a diagonal selective/LTV Jacobian stream from Chapter 9 is recovered exactly (the decoupled coordinates make the QR trivial). Even a system that resembles an S4D-Lin layer at initialization — every mode decaying at the same rate $\Re(\lambda) = -\tfrac12$ , a degenerate spectrum — is recovered exactly, because its modes are decoupled.

The resolution limit appears only when degenerate modes are coupled: Chapter 2’s damped ring, a non-normal recurrence whose modes share a decay rate but rotate at different frequencies, scatters the per-mode estimates by a measured $5.4\times10^{-3}$ while the mean rate stays exact to $1.4\times10^{-16}$ . The lesson for diagnosing a trained model is concrete: trust the top exponent and the sum; distrust an individual interior exponent of a coupled system unless its modulus is well separated.

Two panels. Left: Lyapunov exponent versus mode index. Three systems' QR estimates (filled markers) sit on their closed-form references (thin crosses): a six-mode DPLR transition spread between about minus 0.1 and minus 1.1, a five-mode selective/LTV system, and an S4D-Lin system whose eight exponents lie on a flat line at minus 0.2. Right: per-mode error (estimate minus reference) versus mode index for the non-normal ring; the errors scatter between about plus and minus 0.005 while a dotted line marks the mean error at minus 1.4e-16. — The Lyapunov estimator on constructed systems (§15.4). Left: the Benettin–QR estimate recovers the known spectrum of a Chapter 13 DPLR transition (max error $2.1\times10^{-4}$, top exponent $2.3\times10^{-6}$), a Chapter 9 selective/LTV stream (exact, diagonal), and an S4D-Lin-like degenerate-but-decoupled system (exact — uniform decay $-0.2$). Right: on Chapter 2's non-normal ring, degenerate moduli that are *coupled* cannot be resolved mode-by-mode — the per-mode error scatters by $5.4\times10^{-3}$ — yet the mean rate is exact ($1.4\times10^{-16}$), pinned by the divergence identity even where individual modes blur. Produced by companions/ch15/jax/lyapunov_diagnostics.py (reusing Chapter 2's qr_lyapunov); pinned in tests/test_lyapunov_diagnostics.py.

This is the diagnostic Chapter 7 anticipated — “the structure a stability or Lyapunov analysis of a trained S4 layer measures against” (Proposition 7.3) — and the one Chapter 2 named as the unified internal/input–output stability check, since an eigenvalue drifting toward $|\lambda| = 1$ shows up as a Lyapunov exponent crossing zero (Theorem 2.4). Running it on an actually trained network, where the Jacobian varies with the input and the modes are neither known nor decoupled, is pilot B’s program; the chapter validates the instrument, B points it at the data.

15.5 Regime detection and effective state size

The Lyapunov spectrum answers “is the state bounded?” The second instrument answers “how much of the state is actually doing memory?” A recurrence with $d$ coordinates may keep only a few of them near the marginal-stability boundary $|\lambda| = 1$ (persistent memory) while the rest contract quickly (transient). The count of persistent modes is the system’s effective state size, and it separates a memory regime from a forgetting regime.

Proposition 15.3 (Regime separation by effective state size).

Construct a transition with $r$ marginal modes ( $|\lambda| = 1$ ) and $d - r$ contractive modes ( $|\lambda| = w < 1$ ). Then:

(Two routes, one count.) The algebraic marginal count $\#\{i : |\lambda_i| \ge 1 - \delta\}$ (from the eigenvalues) and the dynamical marginal count $\#\{i : \hat\lambda_i \ge -\delta\}$ (from the Lyapunov spectrum) both equal $r$ for any slack $\delta$ separating the two levels. Independent computations — one algebraic, one dynamical — agree on the memory-mode count.
(Effective state size.) With $p_i = |\lambda_i|^2$ , the participation ratio $D_{\mathrm{eff}} \;=\; \frac{\bigl(\sum_i p_i\bigr)^2}{\sum_i p_i^2} \;=\; \frac{\bigl(r + (d-r)w^2\bigr)^2}{r + (d-r)w^4}$ is a continuous soft-count: $D_{\mathrm{eff}} \to r$ as $w \to 0$ and $D_{\mathrm{eff}} \to d$ as $w \to 1$ , interpolating between the marginal-mode count and the full dimension.

Part 1 is the anti-circularity guard that makes this a diagnostic rather than a tautology: an effective-dimension number “validated” by feeding it its own spectrum proves nothing, but two independent computations — the QR Lyapunov scan and the algebraic participation ratio — agreeing on a system whose $r$ is known by construction is real corroboration. The companion builds the $(r = 3, d = 8)$ system and confirms both routes return $3$ , with the measured participation ratio matching the closed form to $2.7\times10^{-15}$ ; the effective state size runs from $3.025$ at $w = 0.05$ (nearly all memory in the $r = 3$ marginal modes) to $7.980$ at $w = 0.95$ (all eight modes comparably persistent).

Two panels. Left: Lyapunov exponent versus mode index for a system with three marginal and five contractive modes; three points sit on the marginal line at zero and five sit well below at about minus 0.9, with a dashed line at lambda equals zero. Right: effective state size versus contractive magnitude w from 0.05 to 0.95; a measured curve from eigenvalues rises smoothly from 3 to 8, coinciding with a dotted closed-form curve, with horizontal reference lines at r equals 3 and d equals 8. — Effective state size as a regime diagnostic (§15.5). Left: the dynamical route — the Lyapunov spectrum of the constructed $(r = 3, d = 8)$ system places exactly $3$ exponents at zero (marginal/memory) and $5$ below (contractive). Right: the effective state size $D_{\mathrm{eff}}$, measured from the eigenvalues, matches its closed form to $2.7\times10^{-15}$ and interpolates from $r = 3$ (at $w = 0.05$, $D_{\mathrm{eff}} = 3.025$) to $d = 8$ (at $w = 0.95$, $D_{\mathrm{eff}} = 7.980$). The algebraic and dynamical routes agree on the memory-mode count — the cross-check that makes the diagnostic non-circular. Produced by companions/ch15/jax/lyapunov_diagnostics.py; pinned in tests/test_lyapunov_diagnostics.py.

Effective state size is the diagnostic complement to the capacity bound of §15.3: the bound says how many bits a $d$ -coordinate state could hold; the effective size says how many coordinates a particular system actually keeps alive. A model can satisfy the capacity bound with room to spare and still fail on long-range recall because, after training, its $D_{\mathrm{eff}}$ has collapsed to a handful of modes. Measuring that collapse on trained checkpoints — across training, across layers — is, again, B’s empirical program; an independent Julia implementation of the same QR-Lyapunov algorithm (matching the JAX spectra on shared anchors — the diagonal and DPLR closed-form values — to $<10^{-9}$ ) guards the numerics the program will lean on.

15.6 The mechanistic verdict

The impossibility theorems and the diagnostics meet in a mechanistic question: when a recurrent model underperforms a transformer on recall, what fails? The sharpest counter-evidence here comes from the architecture’s own side. Of Bick, Xing & Gu Bick et al. (2025) — the last author is Mamba’s creator — the Gather-and-Aggregate mechanism: a few attention heads do the heavy lifting of in-context retrieval (a “Gather” head extracts the relevant span, an “Aggregate” head integrates it), and recurrent models, lacking the sharp content-based addressing those heads provide, produce a smoother, lossier read. That the sharpest critique of recurrent recall comes from the architecture’s own side motivates the hybrids of Chapter 14 rather than abandoning the line. This is mechanistic evidence, not a theorem — we cite it, and offer Chapter 16’s outer-product reader as the analytic stand-in for the limitation, not a reproduction of the finding.

It also bears on why Chapter 14’s hybrids win: the recall gap that Lee et al. (2025) traces in Mamba–transformer hybrids is the same gap G&A localizes to a few heads. A handful of attention layers supply the sharp addressing the recurrence cannot, and the diagnostics of this chapter — Lyapunov spectrum, effective state size — are how you would find, in a trained hybrid, which layers carry the persistent memory and which have contracted to transients.

The assembled toolkit is three instruments: the capacity bound (an upper limit no fixed state escapes), the Lyapunov spectrum (is the state stable, and which modes persist), and the effective state size (how many modes actually do memory). Each is validated here against a system with a known answer. None is run on a trained network in this chapter — that boundary is deliberate. Probing trained SSMs and hybrids with these tools is pilot B’s contribution, and deciding what the prosecution’s file and the defense’s together imply for architecture choice is Chapter 17’s synthesis.

15.7 What’s next

Chapter 17 integrates the pilots: it takes the diagnostics built here and the benchmark protocol of Chapter 16 and turns them on the C1 (symplectic) and B (two-timescale) programs, weighing the counter-evidence of this chapter against the fourteen chapters of construction that precede it. The instruments of §§15.4–15.5 are exactly what B carries to trained checkpoints; the impossibility results of §15.2 are the ceiling against which Chapter 17 reads any empirical win. Looking back, this chapter closes the stability thread Chapter 13 opened: the matrix-memory spectrum that chapter computed at the architecture level (Proposition 13.1, Proposition 13.2) is now a quantity a diagnostic measures on a concrete, possibly trained, system.

15.8 Exercises

Exercise 15.1 (short)

A diagonal SSM carries $d = 64$ complex modes, each stored at $b = 16$ bits, over a vocabulary $|\Sigma| = 256$ . Use Proposition 15.1 to give the longest sequence $n^\star$ it can losslessly recall. Why is a real model’s usable length strictly smaller than $n^\star$ ?

Solution

The state budget is $d\,b = 64 \times 16 = 1024$ bits, and each token costs $\log_2|\Sigma| = \log_2 256 = 8$ bits, so $n^\star = \lfloor 1024 / 8 \rfloor = 128$ . A real model recalls fewer than $128$ tokens because the counting bound assumes a lossless, injective encoding of prefixes into states; a trained recurrence instead superposes writes (interference, the Chapter 11 rank wall), uses finite arithmetic, and never reaches the information-theoretic optimum. The bound is the ceiling of the ceiling — necessary, far from sufficient.

Exercise 15.2 (short, code)

Run companions/ch15/jax/lyapunov_diagnostics.py. Report the per-mode scatter and the mean-rate error for the non-normal ring. What do the two numbers together tell you about trusting a Lyapunov diagnostic on a trained (hence non-normal) recurrence?

Solution

The ring’s per-mode error scatters by $5.4\times10^{-3}$ while the mean rate is exact to $1.4\times10^{-16}$ . Together they say: on a coupled, non-normal recurrence with near-degenerate moduli, an individual interior Lyapunov exponent is unreliable (the QR frame cannot isolate it), but the sum/mean — pinned by the divergence identity $\sum_i \hat\lambda_i = \langle \log|\det J_t|\rangle$ — is trustworthy. On a trained model, read the top exponent (stability) and the sum (total contraction) confidently; treat an interior exponent as reliable only when its modulus is well separated from its neighbors.

Exercise 15.3 (short, code)

From the same run, report the two marginal-mode counts (algebraic and dynamical) and the effective state size at $w = 0.05$ and $w = 0.95$ for the $(r = 3, d = 8)$ system. Why does reporting both counts matter?

Solution

Both counts return $3$ ; the effective state size is $3.025$ at $w = 0.05$ and $7.980$ at $w = 0.95$ . Reporting both routes matters because they are independent: the algebraic count comes from the eigenvalues, the dynamical count from the QR Lyapunov scan. Their agreement on a system whose $r$ is known by construction is what certifies the diagnostic is measuring a real property rather than re-reporting its own input — the anti-circularity guard. A single route, agreeing only with itself, could be measuring an artifact.

Exercise 15.4 (theory) — solution in §15.9

Prove Proposition 15.1. Model the recurrence’s post-prefix state as a function $\sigma : \mathcal X \to \mathcal S$ , argue every later output factors through $\sigma$ , and apply the pigeonhole principle when $|\mathcal X| > |\mathcal S|$ . Then specialize to verbatim copy to obtain $d\,b \ge n\log_2|\Sigma|$ .

Exercise 15.5 (theory) — solution in §15.9

Prove part 2 of Proposition 15.2 (the divergence identity). Using one Benettin step $Q_{t} R_t = J_t Q_{t-1}$ with $Q$ orthogonal, show $\prod_i |R^{(t)}_{ii}| = |\det J_t|$ , and conclude $\sum_i \hat\lambda_i = \frac1T\sum_t \log|\det J_t|$ for every $T$ , independent of any mode-resolution.

Exercise 15.6 (theory) — solution in §15.9

Prove part 2 of Proposition 15.3. For the two-level spectrum ( $r$ modes at $|\lambda| = 1$ , $d - r$ at $|\lambda| = w$ ), derive $D_{\mathrm{eff}} = (r + (d-r)w^2)^2 / (r + (d-r)w^4)$ from the participation-ratio definition, and take the limits $w \to 0$ and $w \to 1$ .

15.9 Full solutions to theory exercises

Solution to Exercise 15.4

Fix the suffix that poses the queries; then for each prefix $x \in \mathcal X$ the model’s answers are a deterministic function of the state $s = \sigma(x)$ it holds after reading $x$ (the recurrence is deterministic, and everything downstream sees $x$ only through $s$ ). So the entire answer map factors as $x \mapsto \sigma(x) \mapsto \text{answers}$ . If $|\mathcal X| > |\mathcal S| \ge 2^{db}$ , then $\sigma$ cannot be injective (pigeonhole): there exist $x \ne x'$ with $\sigma(x) = \sigma(x')$ , so the model produces identical answers on $x$ and $x'$ . The task requires different answers on $x$ and $x'$ (they are in $\mathcal X$ , distinguished by construction), so the model is wrong on at least one. For verbatim copy, $\mathcal X$ is all $|\Sigma|^n$ length- $n$ strings (each must be reproduced, hence distinguished), so losslessness forces $2^{db} \ge |\Sigma|^n$ , i.e. $d\,b \ge n\log_2|\Sigma|$ . $\blacksquare$

Solution to Exercise 15.5

One Benettin step orthonormalizes the propagated frame: $J_t Q_{t-1} = Q_t R_t$ with $Q_{t-1}, Q_t$ orthogonal and $R_t$ upper-triangular with the sign convention $R^{(t)}_{ii} > 0$ . Take determinants: $\det(J_t)\det(Q_{t-1}) = \det(Q_t)\det(R_t)$ . Orthogonal matrices have $|\det Q| = 1$ , and $\det R_t = \prod_i R^{(t)}_{ii}$ , so

|\det J_t| \;=\; |\det R_t| \;=\; \prod_i |R^{(t)}_{ii}| .

The Benettin estimate is $\hat\lambda_i = \frac1T\sum_t \log|R^{(t)}_{ii}|$ , so summing over $i$ and exchanging the sums,

\sum_i \hat\lambda_i = \frac1T \sum_t \sum_i \log|R^{(t)}_{ii}| = \frac1T \sum_t \log\!\prod_i |R^{(t)}_{ii}| = \frac1T \sum_t \log|\det J_t| .

This holds for every $T$ and uses nothing about whether the individual $\hat\lambda_i$ have converged or whether modes are degenerate — the sum is exact even when the per-mode split is not. $\blacksquare$

Solution to Exercise 15.6

With $p_i = |\lambda_i|^2$ , the two-level spectrum has $r$ entries with $p_i = 1$ and $d - r$ entries with $p_i = w^2$ . Then

\sum_i p_i = r + (d-r)w^2, \qquad \sum_i p_i^2 = r\cdot 1 + (d-r)\,w^4 ,

so by definition

D_{\mathrm{eff}} = \frac{\bigl(\sum_i p_i\bigr)^2}{\sum_i p_i^2} = \frac{\bigl(r + (d-r)w^2\bigr)^2}{r + (d-r)w^4} .

As $w \to 0$ the contractive terms vanish: $D_{\mathrm{eff}} \to r^2 / r = r$ , the marginal-mode count. As $w \to 1$ every $p_i \to 1$ : $D_{\mathrm{eff}} \to d^2 / d = d$ , the full dimension. For $0 < w < 1$ it lies strictly between $r$ and $d$ , a continuous interpolation — the soft count of usable memory modes. $\blacksquare$

15.10 Companion code

The companions live in companions/ch15/{jax,torch,julia} and are float64 throughout; the JAX modules are canonical, the torch module mirrors the diagnostic core with cross-framework parity pinned $<10^{-9}$ , and the Julia module is an independent-language Lyapunov cross-check.

jax/copying_bound.py — the information-counting bound (Proposition 15.1): the state-capacity and lossless-requirement bits, the recall-cliff threshold, and the demonstration that Chapter 16’s slot model (imported, not re-implemented) cliffs exactly at $n^\star = d$ . Produces copying-bound.png.
jax/lyapunov_diagnostics.py — the Lyapunov estimator (Proposition 15.2) reusing Chapter 2’s qr_lyapunov, the constructed systems (DPLR from Chapter 13, selective/LTV from Chapter 9, an S4D-Lin-like degenerate system, and Chapter 2’s ring for the resolution limit), the divergence identity, and the regime/effective-state-size diagnostic with its two-route cross-check (Proposition 15.3). Produces lyapunov-validation.png and regime-separation.png.
torch/lyapunov_diagnostics.py — an eager mirror of the diagnostic core (QR Lyapunov, closed form, effective state size), pinned to the JAX outputs in torch/tests/test_ch15_torch.py; the instrument is framework-agnostic because pilot B runs it on trained torch models.
julia/lyapunov_crosscheck.jl — a stdlib-only independent QR-Lyapunov and eigendecomposition, matching the JAX spectra (diagonal recovery, the DPLR closed form) and the effective-state-size closed form to $<10^{-9}$ .

PYTHONPATH=. python companions/ch15/jax/copying_bound.py
PYTHONPATH=. python companions/ch15/jax/lyapunov_diagnostics.py
make companion-jax-tests          # all chapters' JAX suites
make companion-torch-tests        # JAX↔torch parity
julia --project=companions/ch15/julia companions/ch15/julia/runtests.jl

Part IV · Integration Week 16 Published

Empirical methodology: benchmark protocols and evaluation

Benchmarks as measurement instruments — the discriminative-regime principle for synthetic probes (tokenized MQAR with exact readers), length stress that pads with distractors rather than blanks (L90/AUC), the statistics of honest comparison (paired SEs, selection inflation), and the two-timescale protocol — composite predictor and probe signature — where pilot B's book-side prerequisites close.

Empirical methodology: benchmark protocols and evaluation

Chapter 16 — at a glance

Goal: turn the book’s claims into measurements. The architecture chapters kept ending at the same empirical question — Chapter 12 promised its lineage’s comparisons would be made honest here, and Chapter 14 handed over a benchmark seed and deferred its protocol: layer ratios, gate granularities, write rules, and hybrids judged on the same scale. This chapter builds the scale: what synthetic probes measure (and the narrow regime in which they measure it), how to stress long-range behaviour so that it actually stresses something, which statistics keep a comparison honest, and the two-timescale protocol that completes pilot B’s book-side toolkit.

Reading time: ~45 minutes prose; ~2 hours with the companions.

Key insight: a benchmark is a measurement instrument with a discriminative regime, and outside that regime it measures nothing. Two state sizes separate on associative recall only when the task load sits between them (an exact accuracy-gap formula in §16.2); a fading memory is stressed by fresher competing content, not by blank padding (an exact invariance in §16.4); and a real difference of $0.0145$ nats is decisive or invisible depending solely on whether the comparison is paired (§16.3). Designing the instrument — load, separation, pairing, and what gets reported — is as much a part of the science as the architectures themselves.

Looking ahead: §16.5 assembles the protocol around Chapter 14’s two-timescale task — the composite predictor it deferred here, and the probe-signature method behind pilot B’s disentanglement axis. With this chapter, B’s book-side prerequisites (Chapters 12, 14, 16) are complete.

16.1 The measurement problem

This book has spent fifteen chapters arguing that sequence architectures are points in one design space, not rival species: a selective SSM is a masked linear-attention computation (Theorem 9.5), an additive state holds only as many associations as its rank allows (Proposition 11.4), smarter write rules buy capacity at the price of a stability analysis (Theorem 12.4), and a hybrid is a budget split between exact local computation and compressed carried state (Proposition 14.2). Every one of those claims ends in an empirical question: given two points in the design space, how do you tell which is better, and for what? Chapter 12 promised that the comparisons across its lineage would be made honest here; Chapter 14 handed over a benchmark seed and deferred its protocol. This chapter pays both debts.

The premise is that a benchmark is a measurement instrument, and deserves the same scrutiny as any other instrument: what quantity does it respond to, over what operating range, with what noise floor? The lens chapters tell us what the quantities are. A finite state has a capacity — so there must be a probe that loads it past capacity. Real streams have timescales — so there must be a task whose slow variable a window cannot see and whose decay rate a state must match. Exactness has a range — so there must be a stress that separates content the model must retrieve from content it may forget. And every one of these costs resources — cache, state, throughput — that the comparison must hold fixed or report. None of this is hypothetical: the synthetic-probe methodology of MAD Poli et al. (2024) and Zoology Arora et al. (2024) exists precisely because small tasks chosen this way predicted which architectures were worth pretraining.

What the lens does not supply is automatic validity. A probe can be solvable by machinery unrelated to its advertised quantity; a length stress can fail to stress anything; an evaluation can launder noise into a ranking. Each failure mode appears below with an exact, companion-checked example — including one we built by accident and kept as teaching material.

The chapter is organized as an evaluation stack with four tiers, numbered cheap to expensive: Tier 1 — mechanism probes, synthetic tasks isolating one capability, minutes to run (§16.2); Tier 2 — synthetic long-context stress with controlled length and timescale dials (§§16.4–16.5); Tier 3 — natural long-document suites, realistic but noisy (§16.4); Tier 4 — language-model quality and systems efficiency, the expensive anchor (§16.6). The statistics of §16.3 apply at every tier, and matter most where evaluation is noisiest.

16.2 Synthetic probes and the discriminative regime

A mechanism probe earns its keep by isolating one capability so cleanly that failure has a unique explanation. The Tier-1 canon, by target capability:

associative recall — multi-query associative recall (MQAR): store key–value pairs, answer queries about them. The probe Zoology built, and the one Chapter 11 used to measure the capacity wall Arora et al. (2024) .
in-context pattern completion — the induction-head task: attend to the previous occurrence of the current token and copy its successor Olsson et al. (2022) .
selective copying — copy designated tokens while ignoring filler; the probe that motivated input-dependent selectivity Gu & Dao (2024) .
string copying at length — reproduce an input verbatim. Copying a length- $n$ string means carrying $n$ tokens of information, so a fixed $d$ -dimensional state must fail once $n$ outruns its capacity while the attention cache grows with the input by construction; Jelassi et al. (2024) prove the separation in its strongest form — a two-layer transformer can copy strings of exponential length while fixed-state models are fundamentally limited by their state — which is the empirical face of Chapter 3’s Krylov-dimension reading of the recurrence’s expressive ceiling, and redeems that chapter’s promise here. The impossibility side of the story (what else fixed state provably cannot do) is Chapter 15’s file.
composite suites — MAD’s six-task battery, used as a cheap screen whose scores rank-correlate with scaling behaviour Poli et al. (2024) .

Zoology’s central finding is that the language-model quality gap between efficient architectures and attention traces to associative-recall capability specifically — which is what made MQAR this tier’s anchor probe, and the reason the tier exists.

The companion builds the tokenized MQAR object end to end: episodes $[k_1, v_1, \dots, k_N, v_N, \text{(distractors)}, \text{(gap)}, q_1, \dots, q_N]$ with disjoint key/value/filler alphabets, distinct keys, and every target key queried exactly once; ground truth defined by an independent scan oracle; and a family of exact readers standing in for architectures, in the same no-training discipline as §14.6 — each reader is an information restriction, so every curve below is a computation rather than an optimization outcome. The induction reader (exact-match attention reading the successor token — the Olsson et al. (2022) circuit as an analytic object) answers correctly at every load. The outer-product reader (the Proposition 11.4 mechanism on token embeddings) degrades smoothly past its dimension. And the slot reader — a ring buffer keeping the last $d$ pairs, abstaining on evicted keys — is the exact-capacity idealization, with accuracy exactly $\min(1, d/N)$ at load $N$ : the caricature of a fixed-size state with a hard edge.

The slot reader is deliberately simple, because simplicity is what makes the section’s design principle a theorem rather than a heuristic:

Proposition 16.1 (The discriminative regime of a capacity probe).

Let two readers keep the last $d_1 < d_2$ pairs respectively, on episodes with $N$ distinct stored keys, each queried exactly once, evicted keys answered by abstention. Then each reader’s accuracy is exactly $\min(1, d_i/N)$ , so the accuracy gap

G(N) \;=\; \min\!\Bigl(1, \tfrac{d_2}{N}\Bigr) - \min\!\Bigl(1, \tfrac{d_1}{N}\Bigr)

satisfies: $G(N) = 0$ for $N \le d_1$ ; $G$ is increasing on $(d_1, d_2]$ , attaining its maximum $1 - d_1/d_2$ at $N = d_2$ ; and $G(N) = (d_2 - d_1)/N$ decays to zero for $N > d_2$ . The probe separates the two capacities only for loads near $(d_1, d_2]$ — sized below the smaller state or far above the larger one, it measures nothing about the comparison.

The proof is four lines (Exercise 16.1 walks an instance). Exactly $\min(N, d)$ of the $N$ queried keys remain in a last- $d$ buffer, and abstention makes accidental correctness impossible, giving the accuracy formula with equality; the three regimes of $G$ follow by cases on $N \lessgtr d_1, d_2$ , with monotonicity on $(d_1, d_2]$ because $1 - d_1/N$ increases in $N$ . The companion checks the accuracy identity at rtol=0 — not approximately, exactly — and the gap curve below peaks at $0.75 = 1 - 16/64$ at $N = 64$ , is identically zero through $N = 16$ , and has fallen to $0.0938$ by $N = 512$ .

Two stacked panels over stored pairs N on a log-2 axis from 4 to 512. Top: recall accuracy curves. The two slot-reader curves hold at 1.0 until their capacities (16 and 64) and then fall as d/N exactly; two dashed outer-product curves at the same dimensions degrade later and more smoothly; a dotted line at 1.0 marks the induction reader. Bottom: the accuracy gap between the d=64 and d=16 readers. The solid slot gap is zero through N=16, rises to its peak 0.75 exactly at N=64 inside a shaded band spanning 16 to 64, then decays; the dashed outer-product gap rises later, peaking near N=256. — The discriminative regime, exact and soft. Top: slot readers (solid) are exactly $\min(1, d/N)$; outer-product readers (dashed, mean of 8 seeds) hold past their dimension because argmax decoding tolerates sub-threshold interference, then degrade; the induction reader (dotted) is exact at every load. Bottom: the $d_2$-vs-$d_1$ accuracy gap. The slot gap is zero through $N = d_1 = 16$, peaks at exactly $1 - d_1/d_2 = 0.75$ at $N = d_2 = 64$ (shaded band: the discriminative regime of the proposition), and decays as $(d_2 - d_1)/N$ — $0.0938$ at $N = 512$. The soft readers' gap peaks later ($N = 256$) for the same structural reason, robust decoding shifting the effective capacities right. Produced by companions/ch16/jax/mqar.py; the exact identities and every quoted value are pinned in tests/test_mqar.py.

Two readings of the figure, one per direction of generalization. Soft readers shift, but do not escape, the principle. The outer-product reader’s gap peaks at $N = 256$ , not $64$ : argmax decoding over value embeddings tolerates interference below a threshold, so the effective capacity sits above the raw dimension — but the gap still vanishes at small loads and still decays at large ones, and only the knee’s location moved. Calibrate the load to the mechanism actually under test, not to the nominal state size. The principle is Chapter 14’s lesson in a second coordinate. There, a two-timescale task separates window from carried state only when $w \ll \tau_{\mathrm{id}} \ll 1/\varepsilon$ ; here, a recall task separates two capacities only when $d_1 < N \lesssim d_2$ . Both are instances of one rule: report the dial that places the task in its discriminative regime ( $\bar{\imath}$ there, $N$ here) as a first-class task parameter — a benchmark score without its dial setting is a number without units. The copying probe obeys the same rule: its dial is string length against state capacity, which is exactly why Jelassi et al. (2024) sweep length and why a copying score quoted at one length says little.

16.3 The statistics of honest comparison

Tier-1 probes are nearly noise-free — exact readers, exact oracles. Every tier above them is not, and two statistical failure modes account for a large share of unreproducible architecture comparisons. Both have exact antidotes, and both antidotes are cheap.

Pairing. Two predictors evaluated on the same items produce correlated scores: the items hard for one are mostly hard for the other. The honest uncertainty for a mean difference subtracts that shared difficulty; treating the two score sets as independent samples leaves it in.

Proposition 16.2 (Paired comparison removes shared item difficulty exactly).

Let $(a_i, b_i)_{i=1}^n$ be per-item scores of two predictors on the same $n$ items, with sample variances $\widehat{\sigma}_a^2, \widehat{\sigma}_b^2$ , sample covariance $\widehat{\sigma}_{ab}$ , and sample correlation $\rho$ . The standard error of the mean difference $\bar{a} - \bar{b}$ computed from the paired differences satisfies

\mathrm{SE}_{\text{paired}}^2 \;=\; \frac{\widehat{\sigma}_a^2 + \widehat{\sigma}_b^2 - 2\widehat{\sigma}_{ab}}{n} \;=\; \mathrm{SE}_{\text{unpaired}}^2 - \frac{2\widehat{\sigma}_{ab}}{n},

where $\mathrm{SE}_{\text{unpaired}}^2 = (\widehat{\sigma}_a^2 + \widehat{\sigma}_b^2)/n$ is the two-independent-samples formula. With equal variances the ratio is $\mathrm{SE}_{\text{paired}}^2 / \mathrm{SE}_{\text{unpaired}}^2 = 1 - \rho$ : at $\rho = 0.99$ , pairing shrinks the standard error tenfold; ignoring the pairing overstates it tenfold.

The proof is the variance of a difference of correlated means (Exercise 16.4). The companion’s demonstration uses the §16.5 reference stream: the window-8 and composite predictors differ by $0.0145$ nats of per-token log loss over $n = 8063$ shared positions, with per-item correlation $0.9867$ . The paired standard error is $0.00158$ ; the unpaired formula gives $0.01370$ — $8.66\times$ larger. The same measured difference is a $z = 9.2$ result with the correct statistic and an unpublishable $z = 1.1$ with the wrong one. Pairing is not a refinement; it decides whether the experiment worked.

Selection. The second failure mode is quieter: evaluate $k$ variants — seeds, sweeps, ablation cells — and report the best. If the variants are equally good, the reported number still exceeds the truth by the expected maximum of $k$ noise draws.

Proposition 16.3 (Best-of-k selection inflates by at most σ√(2 ln k), and roughly does).

Let $X_1, \dots, X_k$ be the evaluation-noise components of $k$ variant scores, each mean-zero and $\sigma$ -sub-Gaussian (in particular, $\mathcal{N}(0, \sigma^2)$ ). Then

\mathbb{E}\bigl[\max_{i \le k} X_i\bigr] \;\le\; \sigma \sqrt{2 \ln k}.

Reporting the best of $k$ equally good variants therefore overstates quality by up to $\sigma\sqrt{2 \ln k}$ — a bias that grows without bound in $k$ , shrinks only through $\sigma$ (more evaluation data), and vanishes only when the selection split is independent of the reporting split.

The proof is five lines with the moment generating function (Exercise 16.5). Measured on Gaussian draws, the expectation reaches $1.75\sigma$ at $k = 16$ and $2.34\sigma$ at $k = 64$ against bounds of $2.35\sigma$ and $2.88\sigma$ — the bound is conservative but the growth is real:

Inflation in sigma units versus number of variants k on a log-2 axis from 1 to 64. The measured expected-maximum curve rises from 0 through 0.57 at k=2 to 2.34 at k=64; the dashed bound curve sqrt(2 ln k) rises above it from 1.18 at k=2 to 2.88 at k=64. — Selection bias in isolation: the measured expected maximum of $k$ i.i.d. standard normal scores (2000 trials, seeded) against the $\sigma\sqrt{2\ln k}$ envelope. Measured values $1.7547\sigma$ at $k = 16$ and $2.3383\sigma$ at $k = 64$; the $k = 1$ point is $0$ within Monte-Carlo error. Produced by companions/ch16/jax/protocol.py; values pinned in tests/test_protocol.py.

And measured in the wild: scoring $16$ embedding seeds of the same outer-product reader (identical true accuracy by symmetry) on the same $64$ -episode battery, the best seed beats the across-seed mean by $+0.0051$ accuracy — against a predicted $1.75 \times 0.0028 = 0.0049$ from the measured seed spread $\sigma = 0.0028$ . The theory prices the malpractice to within a rounding error.

The protocol consequences fit in three rules. Pair everything pairable: evaluate all variants on identical items and report paired SEs (per-token losses pair naturally; per-episode accuracies pair across shared episodes). Freeze the suite before the sweep: a task set adjusted after seeing results is a selection mechanism, and inherits the $\sqrt{2 \ln k}$ tax without disclosing its $k$ . Report the sweep, not its maximum: best-of- $k$ is legitimate model selection exactly when the selected model is re-scored on data the selection never touched — the held-out discipline every probe in this chapter applies, down to the ridge probe of §16.5 fitting on the first half of a stream and scoring on the second.

16.4 Long-context evaluation and length-robustness metrics

Tier 2 asks the question the lens cares most about: does the mechanism hold at separations it was not tuned on? The field’s instruments here have a cautionary history worth one paragraph. Long Range Arena Tay et al. (2021) defined the long-context battlefield of 2020–2022 — six classification tasks at lengths $1$ – $16$ K, culminating in Path-X — and S4 famously broke it open, solving Path-X for the first time ( Gu et al. (2022) ; Chapter 8’s kernel machinery is why it could).

But LRA’s tasks are classification over long inputs, not recall, copying, or in-context use of long inputs — the capabilities that §16.2’s probes target and that language modeling actually exercises. As LTI SSMs saturated LRA while failing selective-copying and recall probes (the gap Chapter 9’s selectivity exists to close; Gu & Dao (2024) ), the field’s weight shifted to recall-centric instruments. The methodology lesson is not that LRA was a bad benchmark — it was a well-built instrument for its quantity — but that saturating an instrument retires it: past that point it ranks tie-breaking noise, and §16.3’s selection arithmetic takes over.

The current Tier-2/3 workhorses ask retrieval questions directly. Needle-in-a-haystack stress plants target content at controlled depths in long filler and queries it; RULER Hsieh et al. (2024) systematizes this with multi-key, multi-value, and aggregation variants at controlled lengths, and reports large gaps between claimed and effective context across model families — the single most protocol-shaped finding in the long-context literature: a context-window number on a model card is a claim about an instrument’s range, and RULER is the calibration. On the natural-document side (Tier 3), SCROLLS Shaham et al. (2022) and LongBench Bai et al. (2024) aggregate QA, summarization, and reasoning over real long documents — high validity, but noisy generative metrics (F1, ROUGE) at modest item counts, which is precisely where §16.3’s paired statistics earn their keep.

What should the synthetic version of length stress look like? The companion answers with a negative control we hit by accident. Take the fading-memory reader — the outer-product state decayed by $\rho$ per token, the cleanest caricature of a fixed-rate forgetting recurrence — and stress it by widening the key-to-query separation with neutral filler. Nothing happens. Exactly nothing: every stored weight rescales by the same $\rho^{g}$ , argmax decoding is scale-invariant, and the companion pins prediction-level equality between gap $0$ and gap $512$ at rtol=0. A NIAH-style stress built on blank padding measures nothing about fading memory. What stresses it is fresher competing content: $s$ distractor pairs written after the targets leave the target signal at $\rho^{2s + O(1)}$ against interference that decayed far less. RULER’s multi-key variants embody the same insight (the haystack is made of hay, not vacuum); the companion makes it exact, and the design rule is pad with content, not blanks.

Recall accuracy versus distractor-pair separation s on a log-2 axis from 1 to 1024, four curves. The rho=0.97 curve holds at 1.0 through s=8 then collapses by s=32; rho=0.99 collapses near s=64-128; rho=0.999 collapses past s=256; the no-decay rho=1 curve degrades latest, still 0.33 at s=1024. Dotted vertical lines mark each curve's L90. — Length stress that stresses something: recall accuracy of the fading-memory reader ($N = 8$ targets, dimension 64, mean of 8 seeds) as $s$ distractor pairs separate targets from queries. Faster decay fails sooner — $L_{90} = 16$, $32$, $128$ for $\rho = 0.97$, $0.99$, $0.999$ (AUC $0.4727$, $0.6234$, $0.8305$) — while the undecayed reader ($\rho = 1$, $L_{90} = 256$, AUC $0.9227$) eventually fails too, by capacity crosstalk rather than recency; the induction reader stays at accuracy $1.0$ at every separation. Neutral-filler padding, by contrast, leaves every one of these predictions exactly unchanged (the pinned negative control). Produced by companions/ch16/jax/mqar.py; L90/AUC values and the control pinned in tests/test_mqar.py.

The figure also separates two failure modes the single word “fails at length” conflates: the decayed readers fail by recency (the target faded), the undecayed one by capacity (the buffer filled) — same downward curves, different mechanisms, distinguishable because the generator controls $s$ and $\rho$ independently. That is the Tier-2 advantage over Tier 3 in one picture.

Curves invite single-number summaries, and the long-context literature’s single-number practice is worth formalizing. For a measured accuracy curve $\mathrm{acc}(s)$ over a strictly increasing separation grid $s_1 < \dots < s_m$ :

$L_{90} = \max\{\, s_j : \mathrm{acc}(s_j) \ge 0.9\, \mathrm{acc}(s_1) \,\}$ — the longest separation retaining $90\%$ of short-range accuracy (the range of the instrument, in the calibration sense);
$\mathrm{AUC} = \int \mathrm{acc}\; d(\log_2 s) \,\big/\, (\log_2 s_m - \log_2 s_1)$ — mean accuracy per octave of separation (trapezoid on the measured grid), rewarding robustness across scales rather than across raw tokens.

Both are implemented and exactness-tested in the companion (hand-constructed curves, rtol=0), and both inherit §16.2’s caveat: a single $L_{90}$ without its grid and task dials is a number without units — report the curve, quote the summary.

16.5 The two-timescale protocol

Chapter 14 shipped pilot B’s task: the two-timescale HMM — slow regime, sticky transition $T(\varepsilon)$ , regime-conditioned bigrams with overlap dial $\eta$ — with its exact restriction family, the matched-decay optimality theorem (Theorem 14.4), and the measured design constraint $w \ll \tau_{\mathrm{id}} \ll 1/\varepsilon$ (Figure 14.3). It deferred two things to the protocol: the composite predictor, and the probing method. Both are now in companions/ch16/jax/protocol.py, computed on the same reference instance as Chapter 14’s figures — same constants, same key derivation — so every number here is directly comparable with §14.6’s (the optimal filter’s $1.9289$ nats below is bit-for-bit Chapter 14’s).

The composite restriction. A window- $w$ filter restarted from a uniform prior wastes whatever the past knew; a fixed-decay filter carries the past but applies a possibly mistimed transition at every step. The composite is the natural hybrid idealization: exact Bayes updates (true transition) over the last $w$ tokens, seeded at the window edge with the $\lambda$ -decay filter’s posterior — the carried prior a fixed-decay recurrence could actually hand to an attention window. It comes with two machine-checked identities. With a uniform edge prior it is Chapter 14’s window filter (measured difference $0.0$ — two independently written implementations). And with $\lambda = \lambda^\star(\varepsilon)$ it reproduces the full Bayes filter at every window size (measured difference $2.2 \times 10^{-16}$ at $w = 8$ ): the matched carried prior is the true posterior, exact updates keep it true, and the window adds nothing because nothing was missing — Chapter 14’s matched-decay theorem composing exactly with windowed inference, the idealized version of the matched expansion of §14.2’s lens. Between the identities sits the measurement. At the reference operating point — the carry mistimed at $\lambda^\star(0.2)$ where the truth is $\varepsilon = 0.02$ , the same mistiming §14.6 measured — window-8 alone pays $0.0246$ excess nats, the mistimed decay alone pays $0.0512$ , and the composite pays $0.0101$ — beating its window part by $0.0145$ nats and its decay part by $0.0411$ . A hybrid of two individually mistimed resources beats both, because the window supplies exactness where the carry is stale and the carry supplies memory where the window is blind. That is §14.2’s division of labour as a single measured number.

Why probing is justified — and what it cannot show. Pilot B’s central axis asks where a trained hybrid stores the slow variable, and its method is per-layer probing. Before trusting probes, one should ask what near-optimal prediction actually guarantees. Notation for the answer: $B \in \mathbb{R}^{K \times V \times V}$ stacks the $K$ regime bigram tables over the vocabulary of size $V$ , so $B[:, a, :]$ is the $K \times V$ matrix whose $j$ -th row is regime $j$ ‘s next-token distribution given current token $a$ . For this task the guarantee is a proposition:

Proposition 16.4 (Near-Bayes prediction forces the regime prior into the predictive distribution).

On the two-timescale task, write $P^\star_t$ for the Bayes predictive distribution at position $t$ and $\pi_t$ for the Bayes regime prior, so that $P^\star_t = \pi_t^\top B[:, x_t, :]$ . Let a predictor output predictive distributions $Q_t$ , and let $\delta$ be its expected excess cross-entropy over Bayes — the population quantity that the sample excess plotted in every §14.6 figure estimates. Then:

(i) $\delta = \overline{\mathrm{KL}(P^\star_t \,\|\, Q_t)}$ exactly — expected excess cross-entropy is mean predictive divergence.

(ii) If $\sigma = \min_a \sigma_{\min}(B[:, a, :]) > 0$ (every per-token emission block has full row rank $K$ ), then the map $\pi \mapsto \pi^\top B[:, x_t, :]$ is injective with inverse Lipschitz constant $1/\sigma$ , and the regime prior $\hat{\pi}_t$ recovered from $Q_t$ by inverting it satisfies

\overline{\lVert \hat{\pi}_t - \pi_t \rVert_2} \;\le\; \frac{1}{\sigma}\, \overline{\lVert Q_t - P^\star_t \rVert_2} \;\le\; \frac{1}{\sigma}\, \sqrt{2\delta}.

A predictor within $\delta$ of Bayes must expose the regime prior, to average error $\sqrt{2\delta}/\sigma$ , in its output distribution alone.

Part (i) is one line (the cross-entropy decomposition; Exercise 16.6), and part (ii) is Pinsker plus norm comparison plus Jensen. The companion checks both numerically: the per-position identity behind (i) holds in closed form to $10^{-12}$ with the realized sample excess matching the mean divergence within Monte-Carlo error, and on the reference instance $\sigma = 0.0402 > 0$ , with the regime priors of two different restrictions recovered from their predictive distributions alone — both below $10^{-10}$ in the pinned test, the uniform-restart one to $1.99 \times 10^{-15}$ . The honest reading cuts both ways. The proposition justifies the probing program: the slow variable is provably present in any near-optimal predictor, so looking for it is not wishful. It does not say the information sits in any particular layer, that it is linearly decodable from internal activations, or that probe accuracy measures causal use — probing remains correlational, the inversion constant $1/\sigma$ can be large (here $\approx 25$ ), and locating the information in a trained network is exactly B’s empirical question, not a corollary. Prior art on trained hybrids — e.g. the memory-recall analyses of Lee et al. (2025) — operates under the same caveat.

The probe signature, demonstrated on idealized states. The protocol’s method: extract each candidate state, fit a closed-form ridge probe to the regime labels on the first half of the stream, report held-out argmax accuracy on the second half (the §16.3 discipline; deterministic end to end). Run on each restriction’s internal regime prior — its only carried state — the method produces the disentanglement signature pilot B will look for in trained hybrids:

Horizontal bar chart of held-out regime-probe accuracy for five idealized predictors. Full filter 0.839, composite 0.780, decay 0.691, window 0.717, unigram 0.548. A dotted vertical line marks chance at 0.25 and a dashed line marks the posterior-argmax ceiling at 0.849. — The probe signature on idealized states (reference instance, $w = 8$, mistimed $\lambda$; fit on the first half, scored on the second). The full filter's prior probes at $0.839$, essentially the posterior-argmax ceiling $0.849$; the composite recovers a large fraction of the carried-state gap ($0.780$ vs the uniform-restart window's $0.717$); the badly mistimed decay alone probes at $0.691$ — eight tokens of exact local inference beat a $10\times$-mistimed carry; the unigram tracker sits at $0.548$, well above chance $0.25$ but far from the ceiling. Produced by companions/ch16/jax/protocol.py; all five values pinned in tests/test_protocol.py.

Four measured readings. The full filter’s prior probes at $0.839$ , essentially at the ceiling — the exact posterior’s own argmax accuracy, $0.849$ , which sits below $1$ because even Bayes stays uncertain on this task at this overlap and switch rate. The composite’s carried prior adds $+0.062$ probe accuracy over the uniform restart — the slow variable demonstrably lives in the carry. The mistimed decay alone probes below the window ( $0.691$ vs $0.717$ ): a sufficiently mistimed memory is worth less than a short exact window even for tracking the slow variable itself — mistiming is not a discount, it is data loss, the probing face of §14.6’s V-shaped cost. And every restriction sits far above chance: this task’s regimes leak into everything, which is why the profile across states, not any single accuracy, is the signature.

The signature slots into pilot B’s reporting scheme of five axes — (A) local capability (short-context LM quality), (B) long-context retrieval, (C) compositional reasoning, (D) fast/slow disentanglement (the signature above — B’s contribution), and (E) systems efficiency — because §14.5’s production lineup showed that hybrids are sold on A+E and differentiated on B — while D is precisely the axis nobody yet measures, which is B’s premise — and a benchmark that collapses the axes into one score reproduces exactly the incommensurability this chapter exists to remove.

The chapter’s protocol contribution ends where B’s empirical work begins, and the boundary is worth stating exactly: everything above is computed on exact filters with known semantics — the protocol mechanics, validated in the only setting where ground truth is available. B applies the same mechanics — same task family, same paired statistics, same probe — to trained attention, Mamba, and hybrid stacks, layer by layer, where the answers are not known in advance. That program also has a theoretical anchor outside this book: the statistical-physics analysis of how memory in recurrent networks interacts with the temporal correlation structure of sequence tasks Seif et al. (2022) , the closest thing the disentanglement axis has to a first-principles prediction.

16.6 The assembled protocol

The pieces assemble into the four-tier stack previewed in §16.1 — ordered by cost, used in that order:

| Tier | What it isolates | Cost | Representative instruments | |---|---|---|---| | 1 — mechanism probes | one capability per task: recall capacity, induction, selective copying, copy length | minutes–hours, any GPU; near noise-free | MQAR Arora et al. (2024) , MAD Poli et al. (2024) , induction heads Olsson et al. (2022) , copying Jelassi et al. (2024) , §16.2’s exact readers | | 2 — synthetic long-context | mechanism behaviour vs controlled length/timescale dials; recency vs capacity failure | hours; dials reportable | RULER-style multi-key stress Hsieh et al. (2024) , §16.4’s distractor protocol, §16.5’s two-timescale protocol | | 3 — natural long documents | validity on real text; metric noise dominates | days; §16.3 statistics mandatory | SCROLLS Shaham et al. (2022) , LongBench Bai et al. (2024) | | 4 — LM quality + efficiency | the deployment quantities: perplexity/downstream suites; throughput, cache and state footprint | the expensive anchor | standard LM eval batteries; the cost accounting of Proposition 14.2 |

Three composition rules make it a protocol rather than a list. Iterate cheap, confirm expensive: Tiers 1–2 for design decisions (the MAD philosophy — synthetic scores predicted scaling well enough to design by Poli et al. (2024) ), Tiers 3–4 to confirm survivors; running the stack backwards spends GPU-weeks generating selection noise for §16.3 to price. Every tier reports its dials: load $N$ , separation $s$ , discrimination $\bar{\imath}$ , switch rate $\varepsilon$ — the discriminative-regime principle (Proposition 16.1) applied stack-wide, because a score without its dial is unfalsifiable about what was measured. Failures get a mechanism before a fix: Tier 2’s dials exist to separate recency from capacity from identification difficulty; a Tier-3/4 regression with no Tier-1/2 reproduction is an anecdote. Chapter 12’s honesty promise is discharged the same way: its lineage’s write rules differ exactly in recall capacity vs stability, so the comparison runs MQAR-family probes at matched state budgets (Proposition 11.4, Theorem 12.4) with paired statistics — not a leaderboard average.

The five-axis decomposition maps onto the stack as the reporting layer: axis A (local capability) reads from Tier 4’s quality suite, axis B (long-context retrieval) from Tiers 2–3, axis C (compositional reasoning) from Tier 3, axis D (disentanglement) from §16.5’s probe protocol, and axis E (efficiency) from Tier 4’s systems accounting. Nothing in this chapter required training a model — deliberately, in the same no-training discipline as the rest of the book: the protocol was validated where exactness is available, and its first trained-model consumer is pilot B.

16.7 What’s next

This chapter closed the book’s measurement debt: the discriminative-regime principle (Proposition 16.1), the comparison statistics (Proposition 16.2, Proposition 16.3), the distractor rule for length stress, and the two-timescale protocol with its recoverability guarantee (Proposition 16.4) — and with it, pilot B’s book-side prerequisites. Chapter 13 had already returned to architecture inside the gates — exponential gating and matrix memories (xLSTM, RWKV-7) extending Chapter 12’s lineage — and Chapter 15 filed the prosecution’s case: what SSM-heavy designs provably cannot do (the impossibility side of §16.2’s copying probe) and the diagnostics for catching it. The closing chapter, Chapter 17, integrates the pilots: what C1 and B actually used from these chapters, and what the book’s lens earned against their evidence.

16.8 Exercises

Three short problems (solutions inline) and three longer ones (solutions in §16.9).

Exercise 16.1 (short)

Two fixed-state readers keep $d_1 = 32$ and $d_2 = 128$ pairs. Using Proposition 16.1, find the load $N$ at which their accuracy gap peaks, the peak value, and the gap at $N = 1024$ . For what loads is the gap exactly zero?

Solution

The gap peaks at $N = d_2 = 128$ with value $1 - d_1/d_2 = 1 - 32/128 = 0.75$ . At $N = 1024 > d_2$ the gap is $(d_2 - d_1)/N = 96/1024 = 0.09375$ . The gap is exactly zero for all $N \le d_1 = 32$ (both readers are perfect), and only there — for $N > 32$ it is strictly positive, though negligible for $N \gg d_2$ .

Exercise 16.2 (short, code)

The §16.5 reference point measured excesses $0.0246$ (window-8), $0.0512$ (mistimed decay), $0.0101$ (composite, $w = 8$ ). Re-run the composite at $w = 16$ with the same mistimed $\lambda$ (composite_filter_predictions(hmm, tokens, 16, lam) in companions/ch16/jax/protocol.py). Before running: should its excess be above or below $0.0101$ ? Should the matched- $\lambda^\star$ identity still hold at $w = 16$ ?

Solution

Below: a longer window replaces more of the stale-carry region with exact inference, so the composite’s excess decreases toward $0$ as $w$ grows (at $w \ge \seqlen$ it is the full filter regardless of $\lambda$ ). And yes — the matched identity holds at every $w$ (the proposition-level argument of §16.5: true posterior in, true transition inside), which the test suite pins at $w \in \{1, 4, 8, 64\}$ . Running it measures an excess of $0.0029$ nats at $w = 16$ — between the $w = 8$ composite’s $0.0101$ and the full filter’s $0$ , as predicted, and pinned in tests/test_protocol.py.

Exercise 16.3 (short)

Two models score $\widehat{\sigma}_a = \widehat{\sigma}_b = 0.30$ nats of per-item loss spread on $n = 4000$ shared items, with correlation $\rho = 0.96$ , and mean difference $0.008$ nats. Compute both standard errors and both $z$ -scores. Which conclusion does each support?

Solution

Unpaired: $\mathrm{SE} = \sqrt{2 \times 0.09 / 4000} = 0.00671$ , so $z = 0.008/0.00671 \approx 1.19$ — no detectable difference. Paired: $\mathrm{SE} = \sqrt{2 \times 0.09 \times (1 - 0.96)/4000} = 0.00134$ , so $z \approx 5.96$ — decisive. Same data; the unpaired analysis simply throws away the fact that both models saw the same items.

Exercise 16.4 (theory) — solution in §16.9

(a) Prove Proposition 16.2: derive $\operatorname{Var}(\bar{a} - \bar{b})$ for i.i.d. paired draws $(a_i, b_i)$ with $\operatorname{Cov}(a_i, b_i) = \sigma_{ab}$ , show the paired/unpaired identity $\mathrm{SE}_{\text{paired}}^2 = \mathrm{SE}_{\text{unpaired}}^2 - 2\widehat{\sigma}_{ab}/n$ for the sample versions, and the $1 - \rho$ ratio under equal variances. When can pairing hurt? (b) Prove §16.4’s negative control: the decay reader’s predictions are exactly invariant to inserting $g$ neutral fillers between the stored pairs and the queries, for every $\rho \in (0, 1]$ — and identify precisely where the argument fails when the same $g$ positions are filled with distractor pairs instead.

Exercise 16.5 (theory) — solution in §16.9

Prove Proposition 16.3: for $X_1, \dots, X_k$ mean-zero $\sigma$ -sub-Gaussian (not necessarily independent), show $\mathbb{E}[\max_i X_i] \le \sigma\sqrt{2 \ln k}$ via the moment generating function. Where does the proof use sub-Gaussianity, and why is independence not needed?

Exercise 16.6 (theory) — solution in §16.9

Prove Proposition 16.4: (i) the excess-cross-entropy identity $\delta = \overline{\mathrm{KL}(P^\star_t \| Q_t)}$ , taking expectations under the true process; (ii) the recovery bound $\overline{\lVert \hat{\pi}_t - \pi_t \rVert_2} \le \sqrt{2\delta}/\sigma$ , via Pinsker’s inequality, the norm comparison $\lVert \cdot \rVert_2 \le \lVert \cdot \rVert_1$ , and Jensen. State precisely where full row rank of every $B[:, a, :]$ is used.

16.9 Full solutions to theory exercises

Solution to Exercise 16.4

(a) For i.i.d. pairs $(a_i, b_i)$ with variances $\sigma_a^2, \sigma_b^2$ and covariance $\sigma_{ab}$ , the differences $d_i = a_i - b_i$ are i.i.d. with $\operatorname{Var}(d_i) = \sigma_a^2 + \sigma_b^2 - 2\sigma_{ab}$ , so

\operatorname{Var}(\bar{a} - \bar{b}) = \operatorname{Var}\!\Bigl(\tfrac{1}{n}\sum_i d_i\Bigr) = \frac{\sigma_a^2 + \sigma_b^2 - 2\sigma_{ab}}{n}.

The sample version replaces each population moment by its ddof=1 estimator, and the bilinearity that gives $\widehat{\operatorname{Var}}(a - b) = \widehat{\sigma}_a^2 + \widehat{\sigma}_b^2 - 2\widehat{\sigma}_{ab}$ holds exactly for the sample moments as well (expand the centered sum of squares of $d_i$ ) — hence $\mathrm{SE}_{\text{paired}}^2 = \mathrm{SE}_{\text{unpaired}}^2 - 2\widehat{\sigma}_{ab}/n$ as an algebraic identity, which is the equality the companion’s test_paired_stats_exact_decomposition checks at $10^{-15}$ . With $\widehat{\sigma}_a = \widehat{\sigma}_b = \widehat{\sigma}$ and sample correlation $\rho = \widehat{\sigma}_{ab}/\widehat{\sigma}^2$ ,

\frac{\mathrm{SE}_{\text{paired}}^2}{\mathrm{SE}_{\text{unpaired}}^2} = \frac{2\widehat{\sigma}^2 - 2\rho\widehat{\sigma}^2}{2\widehat{\sigma}^2} = 1 - \rho.

Pairing hurts exactly when $\widehat{\sigma}_{ab} < 0$ : negatively correlated scores make the difference noisier than independence would. Negative item-level correlation between two sequence models on the same stream essentially does not occur — hard tokens are hard for both — which is why the rule in practice is unconditional.

(b) Insert $g$ fillers and consider a query at its shifted position $p + g$ . Every stored pair $i$ keeps its write position $t_i$ , so its decay weight changes from $\rho^{\,p - 1 - t_i}$ to $\rho^{\,p + g - 1 - t_i}$ : every summand of the state acquires the same factor $\rho^{g}$ , because the fillers write nothing. Hence $S_{p+g} = \rho^{g} S_p$ , the read-out vector becomes $\rho^{g}\,(S_p^\top \phi(q))$ , and $\operatorname{argmax}_v\, c\,u_v = \operatorname{argmax}_v\, u_v$ for any scalar $c > 0$ — predictions are exactly unchanged, for every $\rho \in (0, 1]$ (at $\rho = 1$ the factor is $1$ ). The companion pins this at rtol=0 between gap $0$ and gap $512$ . With distractor pairs the argument fails at exactly one step: the inserted positions now write. The state gains $g$ new summands whose decay exponents are small (they sit near the query), so the padded state is $\rho^{g} S_p + (\text{fresh terms})$ — no longer a positive scalar multiple of $S_p$ — and the argmax can flip once the fresh interference outruns the $\rho^{g}$ -attenuated target signal. Scale invariance is the reason blank padding measures nothing, and competing writes are the reason distractors measure something. $\blacksquare$

Solution to Exercise 16.5

For any $t > 0$ , by Jensen’s inequality applied to $\exp$ ,

\exp\bigl(t\,\mathbb{E}[\max_i X_i]\bigr) \;\le\; \mathbb{E}\bigl[e^{t \max_i X_i}\bigr] \;=\; \mathbb{E}\bigl[\max_i e^{t X_i}\bigr] \;\le\; \sum_{i=1}^k \mathbb{E}\bigl[e^{t X_i}\bigr] \;\le\; k\, e^{t^2 \sigma^2 / 2},

where the last step is the definition of $\sigma$ -sub-Gaussianity ( $\mathbb{E}[e^{tX}] \le e^{t^2\sigma^2/2}$ — for $\mathcal{N}(0, \sigma^2)$ this is the exact moment generating function). Taking logarithms,

\mathbb{E}[\max_i X_i] \;\le\; \frac{\ln k}{t} + \frac{t \sigma^2}{2},

and optimizing the right side at $t = \sqrt{2 \ln k}/\sigma$ gives $\mathbb{E}[\max_i X_i] \le \sigma\sqrt{2 \ln k}$ . Sub-Gaussianity enters only through the moment bound; independence is never used, because the union step $\max_i e^{tX_i} \le \sum_i e^{tX_i}$ holds pointwise for any joint distribution — which is what makes the proposition applicable to $k$ correlated ablation cells, not just clean seed sweeps. (For $k = 1$ the bound is $0$ , consistent with mean-zero noise.) $\blacksquare$

Solution to Exercise 16.6

(i) Condition on the prefix $x_{1:t}$ . Under the true process the next token is distributed as $P^\star_t$ (the Bayes predictive is the true conditional — the filter is exact for this model). Hence the conditional expected log losses are $\mathbb{E}[-\log Q_t(x_{t+1})] = H(P^\star_t) + \mathrm{KL}(P^\star_t \| Q_t)$ and $\mathbb{E}[-\log P^\star_t(x_{t+1})] = H(P^\star_t)$ ; subtracting and averaging over positions gives $\delta = \overline{\mathrm{KL}(P^\star_t \| Q_t)}$ . (The companion’s measured “excess CE” is the empirical average of realized log-ratios, whose conditional expectation is exactly this KL. The companion’s test_excess_ce_identity_closed_form checks both layers: the per-position identity between the closed-form expected excess and the closed-form KL at $10^{-12}$ , and the realized sample excess against the mean KL within Monte-Carlo error.)

(ii) Fix $t$ and write $E = B[:, x_t, :]^\top \in \mathbb{R}^{V \times K}$ , so predictive distributions are $P^\star_t = E\,\pi_t$ and the recovered prior solves $E\,\hat{\pi}_t = Q_t$ in least squares — that is, $E\hat{\pi}_t$ is the orthogonal projection of $Q_t$ onto $\operatorname{range}(E)$ ; since $P^\star_t = E\pi_t$ lies in that range and projection is a contraction, $\lVert E\hat{\pi}_t - P^\star_t \rVert_2 \le \lVert Q_t - P^\star_t \rVert_2$ . Full row rank of $B[:, x_t, :]$ means $E$ has full column rank $K$ with $\sigma_{\min}(E) \ge \sigma$ , so for any two priors, $\lVert E(\hat{\pi}_t - \pi_t) \rVert_2 \ge \sigma \lVert \hat{\pi}_t - \pi_t \rVert_2$ — this is the only place rank is used, and it is used at every token value $a$ , hence the minimum over $a$ in the definition of $\sigma$ . Therefore

\lVert \hat{\pi}_t - \pi_t \rVert_2 \;\le\; \frac{\lVert Q_t - P^\star_t \rVert_2}{\sigma} \;\le\; \frac{\lVert Q_t - P^\star_t \rVert_1}{\sigma} \;\le\; \frac{\sqrt{2\,\mathrm{KL}(P^\star_t \| Q_t)}}{\sigma},

using $\lVert v \rVert_2 \le \lVert v \rVert_1$ and Pinsker’s inequality $\lVert P - Q \rVert_1 \le \sqrt{2\,\mathrm{KL}(P \| Q)}$ . Averaging over positions and applying Jensen to the concave square root, $\overline{\sqrt{\mathrm{KL}_t}} \le \sqrt{\overline{\mathrm{KL}_t}} = \sqrt{\delta}$ , which gives $\overline{\lVert \hat{\pi}_t - \pi_t \rVert_2} \le \sqrt{2\delta}/\sigma$ . $\blacksquare$

16.10 Companion code

PYTHONPATH=. python companions/ch16/jax/mqar.py        # §§16.2, 16.4 + both figures
PYTHONPATH=. python companions/ch16/jax/protocol.py    # §§16.3, 16.5 + both figures
make companion-jax-tests                               # full JAX suite
make companion-torch-tests                             # torch parity (incl. ch16)

mqar.py — the tokenized MQAR object: generator (targets, distractors, neutral gap), independent scan oracle, the exact reader family (induction, outer-product, slot, fading-memory), the discriminative-regime and length-robustness experiments, and the L90/AUC metrics. The slot-reader capacity identity, the neutral-gap negative control, and every caption number are pinned in tests/test_mqar.py (55 tests).
protocol.py — the protocol toolkit on the ch14 task (imported, same reference instance): the composite predictor with its two exact identities, per-token paired statistics, selection-inflation experiments, the emission-inversion demo behind Proposition 16.4, and the ridge probe signature. Pinned in tests/test_protocol.py (34 tests).
companions/ch16/torch/ — eager mirrors of the reader scoring paths, the composite filter, and the ridge probe; integer decodes agree exactly and probability paths to $10^{-9}$ against JAX (9 parity tests).

JAX is the canonical implementation; figures are JAX-produced. There is no Julia track for this chapter — the numerical core is filtering linear algebra and counting, with no ODE/integrator content.

Part V · Synthesis Week 17 Published

Niche-pilot integration: from curriculum to research output

The crown-jewel chapter — how the 17-chapter curriculum composes into the two research pilots (C1 symplectic integrators, B two-timescale benchmarks) and the broader 13-niche portfolio, with the instruments of Chapters 6, 10, 14, 15, and 16 run end-to-end on idealized systems as the reproducible template the pilots fill with trained-model data.

Niche-pilot integration: from curriculum to research output

Chapter 17 — at a glance

Goal: close the book by showing the dynamical-systems lens is a research instrument. The sixteen preceding chapters built a vocabulary; this chapter composes it — the symplectic integrators of Chapter 6, the complex-mode SSM of Chapter 10, the two-timescale task of Chapter 14, the diagnostics of Chapter 15, the protocol of Chapter 16 — into the two research pilots the curriculum was designed around (C1, symplectic integrators; B, two-timescale benchmarks), and maps the broader 13-niche portfolio the pilots are drawn from.

Reading time: ~40 minutes prose; ~1.5 hours with the companions.

Key insight: the instruments compose, and their composition is the integration. Two worked demonstrations carry the chapter — a symplectic atlas cell (which integrator best preserves an SSM mode’s energy?) and an end-to-end disentanglement pipeline (does a predictor’s effective state size govern how well it separates fast from slow?). Each runs on the book’s idealized systems and produces a measured signature no single prior chapter produced. The empirical payoff — running these on trained checkpoints — is the pilots’ forthcoming program, so the chapter’s verdict on “what the lens earned” is deliberately provisional.

This is the final chapter. It does not look ahead to more curriculum; it looks outward, to the research the curriculum exists to enable. Where the work goes next (§17.6) is the pilots’ empirical execution, tracked in the predecessor repository.

17.1 What the dynamical-systems lens set out to earn

The book’s thesis has been a single sentence applied sixteen times: a sequence layer is a dynamical system, and you understand it by reading its state. A linear recurrence is a discretized ODE (Definition 1.1); its stability is the spectrum of its transition (Definition 2.2); its capacity is the rank of its state (Proposition 11.4); what it can losslessly recall is set by its state’s bit budget (Proposition 15.1). The claim was never that this lens is the only one — it is that it earns its keep as a research instrument, letting a numerical analyst ask questions about sequence models in their native vocabulary.

A thesis like that is proven by use, not assertion. The curriculum was designed around two research pilots that put the lens to work, and this chapter is where the book’s pieces compose into them:

C1 — symplectic and geometric integrators for SSMs (the primary pilot): does a trained SSM’s transition have enough Hamiltonian structure that a structure-preserving integrator would beat the exp-trapezoidal default of Mamba-3? This is the question only the geometric-integration lens can ask (Chapters 1–3, 6, 10).
B — synthetic two-timescale benchmarks (the parallel pilot): can an architecture cleanly disentangle a fast process from a slow one, and does its effective state size predict whether it can? This composes the singular-perturbation view Kokotović et al. (1986) (Chapter 14), the diagnostics (Chapter 15), and the protocol (Chapter 16).

Two boundaries govern everything that follows, and the chapter marks them everywhere. First, cited versus demonstrated: results we prove or measure are ours; the deep ceilings (the $\mathsf{TC}^0$ impossibility, Merrill et al. (2024) ) are cited. Second, and more load-bearing here, idealized versus trained: every demonstration in this chapter runs on a constructed system with a known answer — an oscillator mode, a known HMM. Turning these instruments on trained checkpoints is the pilots’ empirical program, in flight, not reported here. The chapter ships the reproducible template; the data is forthcoming.

17.2 C1 — symplectic integrators, and an atlas cell

Chapter 6 built the modified-Hamiltonian theory of symplectic integration Hairer et al. (2006) (Theorem 6.2): a symplectic step conserves a perturbed energy exactly, so its energy error stays in a bounded band forever, while a non-symplectic step of the same order lets energy drift secularly (Figure 6.2). Chapter 10 built Mamba-3’s complex-mode SSM Lahoti et al. (2026) and its exp-trapezoidal discretization (Theorem 10.1, Theorem 10.3). The C1 pilot’s opening question fuses them: if a trained SSM mode is near-conservative, is the exp-trapezoidal rule leaving structure on the table that a symplectic integrator would keep?

The bridge is exact. The harmonic oscillator $\dot q = p,\ \dot p = -q$ , written as a complex state $z = q + ip$ , obeys $\dot z = -iz$ — a Mamba-3 complex mode at the purely imaginary eigenvalue $\lambda = -i$ . A near-conservative SSM mode is the same mathematics on a new object, not an analogy — which is what makes C1 a direct transfer of the geometric-integration toolkit rather than a metaphor for it.

Its exact discrete transition is

e^{-i\,dt}

, with magnitude exactly

1

, so the diagonal SSM’s own integrator (the exact exponential) conserves the mode energy

H = \tfrac12|z|^2 = \tfrac12(q^2 + p^2)

by construction.

The atlas cell measures three integrators of this one mode, composing Chapter 6’s Verlet and RK4 steppers with Chapter 10’s exact complex recurrence:

Two panels. Left: energy error H_k minus H_0 over the first six periods; the Verlet curve oscillates within a band of about plus-or-minus 0.00125, while RK4 and the exact-exponential curves are nearly flat at this scale. Right: a log-log plot of the absolute secular drift per period versus step size dt from 0.05 to 0.4; the RK4 line is highest and rises steeply, the Verlet line is far below it and rises gently, and the exact-exponential line sits near machine precision. — The C1 atlas cell (§17.2): three integrators of a harmonic-oscillator SSM mode. Left — the bounded energy oscillation each carries over the first six periods; Verlet's symplectic band ($1.25\times10^{-3}$) is larger than RK4's, the trade-off it pays. Right — the secular (accumulating, least-squares) drift per period vs step size: RK4 drifts (secular slope $-4.36\times10^{-7}$ per period at $dt = 0.1$, whose endpoint drift reproduces Chapter 6's rk4_drift_per_period to $1.3\times10^{-11}$), Verlet's secular slope is $\sim 75\times$ smaller (a measured $5.8\times10^{-9}$, effectively zero — it carries only a bounded oscillation), and the exact-exponential diagonal SSM transition conserves energy to a measured $2.8\times10^{-12}$. Produced by companions/ch17/jax/c1_integration.py (composing ch06 + ch10); cross-checked in stdlib Julia and pinned in tests/test_c1_integration.py.

The reading is sharp, and it is the C1 pilot’s premise in miniature. A symplectic integrator (Verlet) buys zero secular energy drift at the cost of a larger bounded oscillation; a non-symplectic one (RK4) has a smaller oscillation but accumulates drift. But the diagonal SSM’s exact exponential dominates both — it has neither drift nor oscillation, because exponentiating a purely imaginary eigenvalue is exactly energy-preserving. The classical symplectic-versus-standard contest is moot for a diagonal SSM. This is not a null result for C1 — it sharpens the pilot’s target: the symplectic advantage can only reappear where the exponential is not applied exactly — a coupled or non-normal transition (the matrix memories of Proposition 13.1), or the forced integral handled at finite order. Whether a trained selective SSM drifts off the diagonal far enough for that to matter is exactly the empirical question the C1 pilot’s symplectic atlas pursues on real matrices — and which this idealized cell cannot answer, only frame. The pilot’s code and findings live in the predecessor repository’s C1 kickoff.

17.3 B — two-timescale benchmarks, and a disentanglement pipeline

The B pilot asks whether an architecture separates a fast process (token bigrams) from a slow one (regime drift). Chapter 14 built the two-timescale HMM and its matched-decay optimality (Theorem 14.4, Theorem 14.1); Chapter 16 built the probing protocol and the paired-comparison statistics (Proposition 16.1, Proposition 16.2); Chapter 15 built the diagnostics, including effective state size (Proposition 15.3). Each measured one facet. The integration runs all three end-to-end on the same reference instance and asks whether the facets cohere.

For each idealized predictor on the known HMM, the pipeline computes three numbers: the effective state size of its regime-propagation operator (Chapter 15, the participation ratio of the operator’s spectrum), its regime-recovery probe accuracy (Chapter 16, a held-out ridge probe of the true regime from the predictor’s carried state), and its predictive cross-entropy (Chapter 14).

Two panels. Left: regime-probe accuracy versus effective state size for three predictors — unigram at about (1.0, 0.55), decay at about (3.66, 0.69), and full at about (4.0, 0.84) — rising monotonically. Right: a bar chart of cross-entropy in nats for the three predictors, with full lowest at 1.93 (a dotted Bayes-floor line), decay at 1.98, and unigram highest at 2.43. — The B disentanglement pipeline (§17.3) on the Chapter 14 reference HMM ($K = 4$ regimes). Left — regime-recovery probe accuracy rises monotonically with the predictor's effective state size: unigram (collapsed state, $D_{\mathrm{eff}} = 1.00$, probe $0.548$), decay ($D_{\mathrm{eff}} = 3.66$, probe $0.691$), full Bayes filter ($D_{\mathrm{eff}} = 4.00$, probe $0.839$). Right — predictive loss falls in the same order ($2.43 \to 1.98 \to 1.93$ nats, the Bayes floor). The probe accuracies are exactly Chapter 16's probe_signature; the effective state sizes are Chapter 15's; the loss is Chapter 14's. Produced by companions/ch17/jax/b_integration.py; pinned in tests/test_b_integration.py.

The three instruments cohere: a predictor’s effective state size, its disentanglement, and its loss move together.

The comparison is made honest by Chapter 16’s paired statistics: the decay-versus-full per-token loss gap is a measured

0.0512

nats, with a paired standard error of

0.0033

against an unpaired

0.0135

— the shared per-token difficulty (correlation

0.94

) is subtracted off, resolving a gap that an unpaired comparison would call marginal. This is the protocol of Proposition 16.2 turned on the pilot’s own task.

Every predictor here is an exact idealization on a known HMM — there is no trained model and no fitted probe target beyond the closed-form filters. The pilot’s contribution is to run this same pipeline on trained checkpoints: probe a real layer’s regime recovery against its measured effective state size, and read its disentanglement against the impossibility ceiling (Proposition 15.1). That program — generators, baselines, the harness — lives in the B kickoff.

17.4 The 13-niche portfolio and the decision rubric

C1 and B are two cells of a larger catalog. The curriculum was designed against a 13-niche portfolio — research directions a numerical-analyst’s reading of sequence models opens up — grouped into four families. The table below maps each niche to representative chapters it draws on as shipped (the author’s reading of the finished book — not the design-time dependency matrix, which together with the full taxonomy and per-niche kickoffs lives in the predecessor’s curriculum design and niche decision).

| Family | Niche | One line | Draws on | Status | |---|---|---|---|---| | A (diagnostics) | A1 Discretization atlas | catalog integrators against SSMs and step sizes | Ch 4–6, 10 | portfolio | | | A2 Update-rules-as-integrators | delta-rule updates as ODE solvers | Ch 6, 12 | portfolio | | | A3 Diagnostics toolkit | Lyapunov spectra + effective state size on trained models | Ch 2, 15 | portfolio | | | A4 Multi-signal regime detection | combine signals to localize regime boundaries | Ch 9, 15 | portfolio | | | A5 Gather-and-Aggregate | the head-importance mechanism behind the recall gap | Ch 14, 15 | portfolio | | B (benchmarks) | B Two-timescale benchmarks | disentangle fast from slow; the §17.3 pipeline | Ch 12, 14, 15, 16 | active (parallel pilot) | | C (NA core) | C1 Symplectic integrators | structure-preserving discretization; the §17.2 atlas | Ch 1–3, 6, 10 | active (primary pilot) | | | C2 Stiffness analysis | is the stiff regime the real training issue? | Ch 5, 6 | portfolio | | | C3 Krylov / matrix-free | the SSD scan as a Krylov target | Ch 3, 9 | portfolio | | | C4 HiPPO conditioning | conditioning of the HiPPO construction | Ch 3, 7 | portfolio | | D (theory) | D1 Reachability | control-theoretic reachability of SSM states | Ch 1, 2 | portfolio | | | D2 Spectral biographies | eigenvalue trajectories across training | Ch 2, 13, 15 | portfolio | | | D3 ODE-solver-as-architecture | architectures named by their integrator | Ch 5, 6, 10 | portfolio |

The two active pilots were not chosen by enumerating value; they were chosen by a decision rubric worth stating, because it generalizes. (i) Direct-transfer depth — C1 is the same symplectic mathematics the author already knows, on a new object, not a loose analogy. (ii) A shared publishing window — A1, C1, and C2 all bear on the exp-trapezoidal integrator and share the Mamba-3 window, so picking any one keeps the family alive. (iii) Non-overlapping machinery — C1 (geometric integration) and B (singular perturbation) use disjoint tools, so they run in parallel without competing for the same hours. (iv) Paid-up-front cost and a documented-null fallback — C1 ships a structural-classification result even if its empirical case returns negative.

The other eleven niches stay in the portfolio as documented alternatives, each with the chapters it would draw on already in hand.

17.5 What the lens earned, and its limits

So: did the dynamical-systems lens earn its keep? The honest answer is provisionally yes, and here is exactly how far the evidence reaches.

What it earned is the questions, posed precisely and instrumented. “Is exp-trapezoidal leaving Hamiltonian structure on the table?” is not a question the architecture literature asks; it is natural only once an SSM mode is a discretized ODE and the integrator is a choice with a conservation budget (§17.2). “Does effective state size govern disentanglement?” is measurable only once the state has a spectrum and a participation ratio (§17.3). The lens turned vague intuitions — “Mamba is unstable at long horizons”, “hybrids recall better” — into quantities with diagnostics: Lyapunov exponents (Proposition 15.2), effective state size, the probe signature.

What it has not earned, and cannot from this chapter, is an empirical verdict. Every number here is from an idealized system; the trained-model evidence is the pilots’ forthcoming work. And two limits bound even that future evidence. The first is the cited ceiling: the $\mathsf{TC}^0$ circuit-class impossibility ( Merrill et al. (2024) ; the book’s own counting bound, Proposition 15.1, is its weaker, architecture-agnostic shadow) means no amount of clever discretization or benchmarking lets a fixed-state recurrence do what the complexity class forbids — the pilots can measure where a model sits beneath the ceiling, never move it. The second is mechanistic: the Gather-and-Aggregate finding ( Bick et al. (2025) ) says the recall gap lives in a few attention heads, so a pure-SSM pilot result is read against a known mechanism, not in a vacuum. The lens is an instrument for locating a model in a landscape whose boundaries other tools drew; that is a real contribution, and a bounded one.

17.6 Where this goes

This is the last chapter, so “what’s next” is not more curriculum — it is the research the curriculum exists to enable. The two pilots execute on their own timelines: C1 builds the symplectic atlas on trained Mamba-3 matrices and negative-eigenvalue SSM extensions; B runs the §17.3 pipeline on trained hybrids and reports the disentanglement signature against the ceiling. Both are tracked in the predecessor repository, and both have a documented-null fallback so they produce an artifact either way. The other eleven niches wait in the portfolio with their prerequisites in hand.

With this chapter the book is content-complete: seventeen chapters from linear ODEs to a research program, every architecture read as a dynamical system. The remaining work is not authoring but upkeep — the ordinary maintenance of a living book, its open items tracked in the repository’s issues.

17.7 Exercises

Exercise 17.1 (short)

A constructed SSM mode has eigenvalue $\lambda = -\gamma + i\omega$ with $\gamma > 0$ small. Using §17.2, which integrator — exact-exponential, Verlet, or RK4 — best preserves the mode’s long-horizon energy decay rate, and why does the answer change if the mode is embedded in a coupled (non-diagonalizable) transition?

Solution

For the diagonal mode the exact exponential wins outright: its discrete transition is $e^{\lambda dt}$ with magnitude $e^{-\gamma dt}$ , reproducing the true decay rate exactly at any step size, with no oscillation and no secular drift (the companion measures the conservative limit $\gamma = 0$ at a band of $2.8\times10^{-12}$ ). Verlet and RK4 only approximate $e^{\lambda dt}$ and carry the oscillation/drift trade-off of §17.2. The answer changes for a coupled transition: when the state is not diagonalized (a non-normal matrix memory, Proposition 13.1), the exponential is not applied mode-by-mode and is no longer exact or cheap, and a structure-preserving integrator can beat a generic one — which is precisely the regime the C1 pilot tests on trained matrices.

Exercise 17.2 (short, code)

Run companions/ch17/jax/b_integration.py. Report the effective state size, probe accuracy, and cross-entropy for the full / decay / unigram predictors, and the paired decay-vs-full standard error. What does $\mathrm{SE}_{\text{paired}} \ll \mathrm{SE}_{\text{unpaired}}$ tell you here?

Solution

The readout is full $(D_{\mathrm{eff}} = 3.998,\ \text{probe } 0.839,\ \text{CE } 1.929)$ , decay $(3.657,\ 0.691,\ 1.980)$ , unigram $(1.000,\ 0.548,\ 2.425)$ — probe accuracy and (negative) loss both monotone in effective state size. The paired comparison gives a mean gap $0.0512$ nats with $\mathrm{SE}_{\text{paired}} = 0.0033$ against $\mathrm{SE}_{\text{unpaired}} = 0.0135$ . Because the two predictors are scored on the same tokens, their per-token losses are highly correlated ( $0.94$ ); pairing subtracts that shared difficulty (Proposition 16.2), so a gap that is only $\sim 4\sigma$ unpaired is $\sim 15\sigma$ paired. The lesson: report paired statistics when comparing predictors on a shared task, or you will under-resolve real differences.

Exercise 17.3 (short)

Pick a portfolio niche from §17.4 — say A3 (diagnostics toolkit) or D2 (spectral biographies). List the chapters it draws on and name the specific companion instruments you would compose to bootstrap it.

Solution

A3 (diagnostics toolkit) draws on Chapters 2 and 15: compose companions/ch15/jax/lyapunov_diagnostics.py (lyapunov_spectrum, effective_state_size, the resolution-limit caveat) with Chapter 2’s qr_lyapunov engine, and apply them to trained-model Jacobians. D2 (spectral biographies) draws on Chapters 2, 13, and 15: track transition_spectrum (Chapter 13) and the effective state size across training checkpoints, watching the eigenvalues’ trajectory — the §17.3 pipeline is the per-checkpoint measurement, repeated over training. Both are the same move: take an idealized instrument validated in the book and point it at trained data, the niche supplying the experimental design.

Exercise 17.4 (theory) — solution in §17.8

Make the §17.2 bridge precise. Show that the harmonic oscillator $\dot q = p,\ \dot p = -q$ in the complex coordinate $z = q + ip$ satisfies $\dot z = -iz$ , that the exact discrete transition over a step $dt$ has magnitude $1$ , and conclude that a diagonal SSM with a purely imaginary eigenvalue conserves the mode energy under the exact exponential. Then argue when a symplectic integrator is nonetheless needed.

Exercise 17.5 (theory) — solution in §17.8

Design a two-timescale evaluation for a given pair of architectures, respecting Chapter 16’s discipline: size the task to the discriminative regime (Proposition 16.1), use the distractor rule, and use paired statistics. State precisely what measured outcome would count as a resolved disentanglement difference between the two architectures, and why an unpaired comparison could miss it.

Exercise 17.6 (theory) — solution in §17.8

Argue from the capacity bound (Proposition 15.1) what the C1 and B pilots can and cannot hope to demonstrate. Specifically: can either pilot’s empirical win raise a fixed-state model’s expressive ceiling, or only locate the model beneath it? Frame the answer in terms of the cited-versus-demonstrated boundary of §17.1.

17.8 Full solutions to theory exercises

Solution to Exercise 17.4

Write $z = q + ip$ . Then $\dot z = \dot q + i\dot p = p + i(-q) = p - iq$ . Factor: $-iz = -i(q + ip) = -iq + p = p - iq$ , so $\dot z = -iz$ exactly. The continuous solution is $z(t) = e^{-it}z(0)$ , and the exact discrete transition over a step $dt$ is multiplication by $e^{-i\,dt}$ , whose magnitude is $|e^{-i\,dt}| = 1$ . The mode energy is $H = \tfrac12|z|^2 = \tfrac12(q^2 + p^2)$ , and since $|z_k| = |e^{-i\,dt}|^k |z_0| = |z_0|$ , the energy is constant for all $k$ — conserved exactly, with no oscillation and no drift (the companion measures a band of $2.8\times10^{-12}$ , i.e. machine precision). A diagonal SSM applies $e^{\lambda dt}$ mode-by-mode, so it inherits this exactness on every imaginary mode; the symplectic machinery of Chapter 6 buys nothing it does not already have. A symplectic integrator is needed only when the exponential is not applied exactly: a non-normal or coupled transition (where diagonalization is unavailable or ill-conditioned, Proposition 13.1), or the forced/input integral handled at finite order (Theorem 10.1), where a structure-preserving step can keep the conservation budget a generic step spends. $\blacksquare$

Solution to Exercise 17.5

Size the task so the load sits in the discriminative regime of Proposition 16.1: the two architectures must differ in the capacity the task probes, or the benchmark measures nothing about them (a knee-region requirement, the §16.2 lesson). Construct the separation with the distractor rule — pad with content (fresher competing writes), not neutral fillers, since neutral padding is provably inert under an argmax read-out. Score each architecture’s per-item predictive loss (or probe accuracy) on the same sampled sequences, and compute the paired statistics of Proposition 16.2. A resolved difference is one whose paired confidence interval $\text{mean diff} \pm z\,\mathrm{SE}_{\text{paired}}$ excludes zero. An unpaired comparison can miss it because the per-sequence difficulty is shared (both architectures find the same sequences hard), inflating $\mathrm{SE}_{\text{unpaired}}$ by the shared-difficulty variance that pairing removes — as in §17.3, where the same gap is marginal unpaired and decisive paired. One must also pre-register the suite and avoid post-hoc subset selection, or the selection inflation of Proposition 16.3 manufactures a difference. $\blacksquare$

Solution to Exercise 17.6

Neither pilot can raise the ceiling. The capacity bound (Proposition 15.1) and the cited $\mathsf{TC}^0$ impossibility ( Merrill et al. (2024) ) are statements about every fixed-state recurrence: they bound what such a model can compute regardless of how it is discretized (C1) or benchmarked (B). A better integrator changes the numerical fidelity of the discretization, not the model’s expressive class; a better benchmark changes what we can measure, not what the model can do. So a C1 empirical win means “a structure-preserving integrator tracks this trained mode’s dynamics with less drift”, and a B win means “this architecture’s effective state size lets it disentangle to this measured degree” — both locate the model beneath the ceiling, with higher resolution than before. That is exactly the cited-versus-demonstrated boundary of §17.1: the ceiling is cited (a theorem of others, about a complexity class), and the pilots demonstrate position beneath it (measurements, on trained models). Confusing the two — claiming an integrator or a benchmark “breaks” an impossibility result — is the error the boundary exists to prevent. $\blacksquare$

17.9 Companion code

The companions live in companions/ch17/{jax,julia} and are float64 throughout; they compose shipped instruments rather than introduce new kernels, so there is no torch port (it would only re-run the JAX modules). Each produces a new integrated signature; the component values reduce to the originating chapters’ (pinned in the tests).

jax/c1_integration.py — the C1 atlas cell (§17.2). Composes Chapter 6’s symplectic/RK4 steppers and Chapter 10’s complex-mode recurrence to compare the long-horizon energy conservation of the exact-exponential, Verlet, and RK4 integrators on a harmonic-oscillator SSM mode. The reused RK4 path reproduces Chapter 6’s rk4_drift_per_period exactly. Produces c1-atlas-cell.png.
jax/b_integration.py — the B disentanglement pipeline (§17.3). Composes Chapter 14’s HMM and idealized predictors, Chapter 16’s probe signature and paired comparison, and Chapter 15’s effective state size into one readout linking effective state size, probe accuracy, and loss across the predictor family. Produces b-disentanglement.png.
julia/symplectic_crosscheck.jl — a stdlib-only, independent-language cross-check of the C1 atlas cell (the C1 pilot’s atlas is itself Julia), matching the JAX energy-conservation signatures to $< 10^{-9}$ .

PYTHONPATH=. python companions/ch17/jax/c1_integration.py
PYTHONPATH=. python companions/ch17/jax/b_integration.py
make companion-jax-tests          # all chapters' JAX suites
julia --project=companions/ch17/julia companions/ch17/julia/runtests.jl