Linear ODEs as state-space systems
Continuous-time linear systems, the matrix exponential as fundamental solution, the damped oscillator as canonical example, and the eigenvalue classification of phase portraits.
Linear ODEs as state-space systems
1.1 What is a state-space system?
A linear state-space system is the pair of equations
where is the state at time , is the input signal, is the output, and the four matrices are real-valued with shapes , , , . The first equation is the state equation — a first-order linear ODE describing how the state evolves. The second is the output equation — a static linear readout. The integer is the state dimension; it controls how much history the model can carry forward.
The pair is “linear” because both equations are linear in and separately. It is “state-space” because summarizes everything the system needs from the past in order to predict the future given future inputs — a property called the Markov property of state. Given the present state and the input from now onward, the system’s entire future is determined.
A few conventions before going further. We treat as a column vector and apply on the left: , not . We absorb the feedthrough term into the output equation only — it appears nowhere in the state dynamics — and we will often set in examples to keep notation light. When the input is absent (), the state equation reduces to
the homogeneous linear ODE, whose solution is the focus of §1.2.
Why this object
Three reasons the state-space form is the right starting point for a book on sequence-model architectures:
- It separates dynamics from input. Whatever the input signal looks like, the dynamics matrix alone determines the system’s long-run behavior — its stability, oscillation frequencies, and decay rates. Most of the analysis in Chapter 2 looks only at .
- It generalizes immediately. Replacing the scalar derivative with a finite difference yields a discrete recurrence (Chapter 4); replacing with admits oscillatory modes (Chapter 6, Chapter 10); replacing the constant with an input-dependent yields the selective-SSM family (Chapter 9). Each generalization preserves the state-space skeleton.
- It’s the form in which continuous-time models are discretized. Every SSM derivation in the literature starts from a continuous state equation and applies a numerical integration scheme to get a discrete recurrence. If you don’t know the continuous object, the discrete one is opaque.
1.2 The matrix exponential and fundamental solutions
The homogeneous ODE with initial condition has a unique solution for every — and that solution is
where is the matrix exponential of , defined by the same series as the scalar exponential:
For any square matrix , the matrix exponential is
The series converges absolutely for every (the entrywise -norm of the partial sums is bounded by ), so is always well-defined.
The fundamental solution property — that solves the homogeneous ODE — is verified by differentiating the series term-by-term and using . The resulting derivative is , which matches . Existence and uniqueness for the inhomogeneous case () follow the variation of parameters formula
an integral that recurs (in discretized form) as the convolution kernel of every LTI SSM in Chapter 8.
Eigenvalue structure determines everything
The matrix exponential’s behavior is governed entirely by the eigenvalues of . If is diagonalizable, with , then
So acts on the eigenbasis as independent scalar exponentials. The real parts of control decay or growth: means ; means blow-up. The imaginary parts control oscillation frequency. Three regimes recur throughout the book:
- All eigenvalues with negative real part — the system is asymptotically stable: every trajectory as . This is the regime in which a recurrence “forgets” its initial state, and it is the design target for the matrices of S4, Mamba, and friends.
- At least one eigenvalue with positive real part — the system is unstable: some initial conditions blow up. A trained SSM whose drifts into this regime is the dominant numerical failure mode (Chapter 5).
- Purely imaginary eigenvalues — the system is marginally stable and oscillates without decay. These are the energy-preserving modes of Hamiltonian systems and are the natural home of symplectic discretization (Chapter 6).
When is not diagonalizable — when it has repeated eigenvalues with deficient eigenspaces — the matrix exponential picks up polynomial-in- factors on top of the terms. The Jordan normal form (Chapter 3, §3.1) gives the precise statement. In practice almost every arising from random initialization or from physical systems is diagonalizable, and the polynomial-factor case shows up only at codimension-one boundaries in parameter space.
1.3 The damped harmonic oscillator
The canonical small example. A unit mass on a spring with stiffness and damping coefficient , displacement , evolves according to Newton’s second law,
This is a second-order scalar ODE, not first-order, so it does not yet fit the state-space template. We lift it by treating velocity as an extra state coordinate: let . Then with
This lifting trick — introduce auxiliary states to reduce ODE order to one — is the same procedure by which every higher-order ODE becomes a first-order state-space system, and it foreshadows how implicit higher-order schemes (Chapter 6) work internally.
The eigenvalues of are the roots of the characteristic polynomial :
Three sub-cases follow from the sign of the discriminant :
- Overdamped (): two real negative eigenvalues. Both modes decay monotonically; the displacement approaches zero without oscillation.
- Critically damped (): a repeated negative eigenvalue . The matrix is in this case not diagonalizable, and the solution contains a term.
- Underdamped (): a complex-conjugate eigenpair , with . Trajectories spiral toward the origin: oscillation at angular frequency with envelope decaying as .
The total energy is a useful diagnostic. Its time derivative is , i.e. energy decreases monotonically whenever . Both stationary and oscillating trajectories satisfy this. The companion damped_oscillator.py simulates the underdamped regime and plots alongside the trajectory; the energy curve gives an at-a-glance check that the numerical integration preserves the dissipation structure of the continuous system.
1.4 Coupled systems and the Jacobian
The damped oscillator is a 2-dimensional toy. Real systems — and trained SSMs — live in much higher dimensions, and they typically arise from coupling many simpler subsystems. The bookkeeping is straightforward once you accept that the state vector stacks all subsystem states and encodes both intra-subsystem dynamics and inter-subsystem coupling.
A useful larger example is a ring of coupled oscillators: identical damped oscillators arranged on a circle, each coupled to its two nearest neighbors by springs of stiffness . The displacements and velocities stack into a state vector . The matrix has a block-diagonal structure for the per-oscillator dynamics, plus an off-diagonal coupling pattern that links each oscillator’s velocity equation to its neighbors’ positions:
with indices taken mod . The bracketed term is the discrete Laplacian of around the ring; it’s the same discretization of you would use in a finite-difference method for the wave equation.
The Jacobian matrix — which for a linear system is just — has eigenvalues that are easy to compute analytically because of the ring’s circulant structure. They come in complex-conjugate pairs
(for the underdamped regime). Each indexes a standing wave mode on the ring with wavenumber ; the mode is uniform translation (no spatial structure), (when is even) is the maximally-oscillating mode where adjacent oscillators move in antiphase. The companion coupled_oscillators.py constructs for and visualizes the eigenvalues in the complex plane.
The key qualitative lesson: once you have the right state-space lift, even a system with rich spatial structure reduces to “compute eigenvalues of , classify by real and imaginary parts.” The eigenvalue distribution of a trained SSM’s — pre-training, mid-training, and converged — is one of the most informative diagnostics you can compute (Chapter 2, §2.2; Chapter 5).
The Jacobian beyond linear systems
For a nonlinear system , the Jacobian is the matrix of partial derivatives at . The linearization around a fixed point (where ) describes how small perturbations evolve. The eigenvalues of then determine local stability of by the same classification as in §1.2.
Trained recurrent networks — including SSMs in inference mode — are formally nonlinear (the state update is state = f(state, input) where f includes the input-dependent matrices). The linearization-around-a-state gives a local state-space system at each time step, and tracking how that local varies along a trajectory is the foundation of the Lyapunov-exponent analysis in Chapter 2.
1.5 Phase portraits and the eigenvalue classification
A phase portrait is the geometric picture of all trajectories of an autonomous system in state space, drawn as oriented curves. For 2-dimensional linear systems on , the topology of the portrait is determined entirely by the trace and the determinant — equivalently, by the eigenvalues — and the standard classification gives six qualitative behaviors:
| Region in -plane | Eigenvalues | Phase portrait |
|---|---|---|
| , , | Real negative | Stable node |
| , , | Real positive | Unstable node |
| , , | Complex conjugate, | Stable spiral |
| , , | Complex conjugate, | Unstable spiral |
| , | Pure imaginary | Center (closed orbits) |
| Real, opposite signs | Saddle |
The damped oscillator’s three sub-cases (over/critical/under-damped) realize three of these: stable node, stable degenerate node, stable spiral. The undamped limit slides the eigenvalues onto the imaginary axis, turning the spiral into a center; this is the Hamiltonian limit where energy is conserved exactly.
For higher-dimensional systems the topology is richer (think: 3-D systems can have spirals around line-shaped invariant manifolds), but the building blocks are the same: each pair of complex-conjugate eigenvalues contributes a 2-D spiral subspace, each real eigenvalue contributes a 1-D direction of decay or growth. This decomposition by invariant subspaces — equivalently, by the Jordan blocks of — is the geometric content of the matrix exponential’s eigenvalue structure (§1.2).
1.6 A preview of frequency response
The variation-of-parameters formula
expresses the state as a convolution of the input with the kernel . The output (taking ) is therefore
The function for (and zero otherwise) is the impulse response of the system: the output you would observe if the input were a Dirac delta .
Taking the Laplace transform turns convolution into multiplication. With and similarly for , you get
where is the transfer function. The eigenvalues of — already shown to govern — appear as the poles of , since blows up exactly when equals an eigenvalue of . The poles’ location in the complex -plane mirrors the eigenvalue classification of §1.5: left-half-plane poles ⇔ stable system; right-half-plane poles ⇔ unstable; imaginary-axis poles ⇔ marginally stable / oscillatory.
This connection is the bridge to Chapter 8, where you’ll see that the S4 layer’s convolutional view is literally a numerical evaluation of the impulse response at uniformly spaced time points, and to Chapter 11, where the spectral structure of long convolution kernels (Hyena, RetNet) is analyzed via the same transfer-function framing.
1.7 What’s next
The next two chapters develop the analytical machinery that this chapter has only sketched. Chapter 2 makes the eigenvalue-stability story rigorous (Lyapunov, A-stability, BIBO) and adds the QR-based computational method for tracking eigenvalue trajectories during training. Chapter 3 introduces the structured-matrix vocabulary (Toeplitz, semiseparable, Vandermonde, Cauchy) needed to discuss the kernel constructions of S4 and friends. Chapters 4–6 then take the continuous state-space system here and discretize it — first crudely (ZOH, bilinear), then with full numerical-analysis machinery (Butcher tableau, A-stability), then via implicit and structure-preserving methods (Gauss–Legendre, symplectic integrators), which is where the active research pilot’s anchor (the C1 pilot) lives.
You can also profitably jump ahead: Chapter 8’s LTI SSM presentation reuses everything in this chapter and is self-contained for readers who want to see the SSM application before working through the full math foundation.
1.8 Exercises
Seven problems mixing computation and theory. Short/numerical exercises (1–4) have inline collapsible solutions; long/proof exercises (5–7) have full worked solutions in §1.9.
Exercise 1.1 (computation)
Compute the matrix exponential of by hand, term-by-term, and verify it equals the rotation matrix .
Solution
Compute powers: , so , , and the pattern repeats with period 4. Splitting the series by parity:
which expands to . ∎
Exercise 1.2 (computation)
For the damped oscillator with and , classify the damping regime and find the eigenvalues of .
Solution
Discriminant: , so the system is underdamped. Eigenvalues: . The trajectory spirals toward the origin with envelope decay rate and angular frequency .
Exercise 1.3 (computation)
Lift the third-order ODE into a 3-dimensional state-space system . Write and explicitly.
Solution
Let . Then , and from the ODE , so
The bottom row encodes the ODE’s coefficient signature; the upper-triangular ones implement the bookkeeping . This is the companion form of the ODE.
Exercise 1.4 (computation)
For the ring of identical damped oscillators with , , (no damping, symmetric coupling), use the formula in §1.4 to find the eigenvalues of analytically. Confirm they lie on the imaginary axis (since ) and verify by running companions/ch01/jax/coupled_oscillators.py with .
Solution
With , for . Numerically:
- :
- :
- :
- : same as by symmetry ⇒
All 8 eigenvalues are pure imaginary (centers), as expected for an undamped conservative system. The numerical verification reproduces these (modulo double-precision noise).
Exercise 1.5 (theory) — solution in §1.9
Prove that the matrix-exponential series converges absolutely (entrywise) for every square matrix , with bound where is the Frobenius norm.
Exercise 1.6 (theory) — solution in §1.9
Show that if is skew-symmetric (), then is orthogonal for every (that is, ). Use this to give a geometric reason why the rotation matrix of Exercise 1.1 is orthogonal.
Exercise 1.7 (theory) — solution in §1.9
Prove the variation of parameters formula: the unique solution of with is
1.9 Full solutions to theory exercises
Solution to Exercise 1.5
Let denote the Frobenius norm, which satisfies the sub-multiplicative property . Then by induction. Therefore
The partial sums are entrywise bounded by terms of a convergent scalar series, so each matrix entry is Cauchy in and converges. The limit is what we call . The bound follows by passing to the limit. ∎
Solution to Exercise 1.6
We have and . Let . Then
since by skew-symmetry. So for all , i.e. is orthogonal.
Geometric interpretation: the rotation matrix of Exercise 1.1 was generated by , which is skew-symmetric. So is orthogonal — which we already knew, but this derivation pins down why: orthogonality follows structurally from skew-symmetry of the generator, not from any property of the cos/sin functions. ∎
Solution to Exercise 1.7
Define by the formula on the right-hand side. We verify (a) the initial condition and (b) the ODE.
Initial condition. At the integral vanishes (zero-length interval) and , so . ✓
ODE. Differentiate using the product rule on the integral (Leibniz’s rule for differentiating under the integral sign):
Uniqueness follows from the classical Picard–Lindelöf theorem applied to the right-hand side , which is Lipschitz in uniformly in with constant (see Hairer–Nørsett–Wanner Hairer et al. (1993) for the standard proof). ∎
1.10 Companion code
Three runnable companions in companions/ch01/jax/:
damped_oscillator.py— simulates the underdamped oscillator, plots energy decay (Figure 1.1)coupled_oscillators.py— constructs the ring-Laplacian state matrix, plots eigenvalue spectrum (Figure 1.2)matrix_exponential.py— comparesscipy.linalg.expmagainst truncated series sums; the truncated series catastrophically diverges for matrices with large spectral radius (auxiliary, used by Exercise 1.5)
All companions import their plotting style from companions/_shared/plot_utils.py. To run from the repo root:
PYTHONPATH=. python companions/ch01/jax/damped_oscillator.py
PYTHONPATH=. python companions/ch01/jax/coupled_oscillators.py
PYTHONPATH=. python companions/ch01/jax/matrix_exponential.py
Figures are written to public/figures/ch01/ and referenced from the <Figure> components above.
Stability theory: Lyapunov, A-stability, BIBO
Three distinct notions of stability — Lyapunov (eigenvalue criterion + QR-based computation), A-stability (regions of absolute stability for ODE integrators), and BIBO (bounded-input bounded-output for LTI systems).
Stability theory: Lyapunov, A-stability, BIBO
2.1 Three notions of stability
For the rest of this chapter, the system under analysis is the LTI state-space model from Chapter 1:
with . The three notions:
- Lyapunov stability asks: with , do trajectories stay bounded as ? Asymptotically stable if .
- A-stability is a property of a numerical integrator, not of the system. It asks: does the integrator preserve Lyapunov stability for every stable LTI test problem, at every step size ? If yes, the method is A-stable.
- BIBO stability asks: for the input–output mapping , does every bounded input () produce a bounded output? The system is BIBO-stable if yes.
When has all eigenvalues in the open left half-plane (LHP), all three coincide for the LTI case. But:
- Lyapunov-stable systems can be Lyapunov-stable but not asymptotically stable (centers: pure imaginary eigenvalues, oscillations don’t decay).
- A-stability is a integrator property; the explicit Euler method is not A-stable, while the implicit Euler method is. The same continuous system therefore may need a stable integrator to preserve its stability.
- BIBO depends on the transfer function and on the choice of realization. Two systems with the same can have different BIBO properties if their “hides” unstable modes.
The rest of the chapter develops each notion in turn.
2.2 Lyapunov stability: theory
We restrict to the homogeneous case (zero input). The eigenvalue criterion is essentially the content of Chapter 1, §1.2, formalized:
Let with eigenvalues . The system is:
-
Lyapunov-stable (trajectories bounded) iff every satisfies , and for every on the imaginary axis (), its algebraic multiplicity equals its geometric multiplicity (no defective Jordan blocks).
-
Asymptotically stable (trajectories ) iff every satisfies , i.e. all eigenvalues are in the open left half-plane.
-
Unstable otherwise.
The proof reduces to the matrix-exponential formula. If is diagonalizable, then and the -th eigencomponent decays like . The Jordan-block caveat in case (1) handles the case where a defective imaginary eigenvalue produces polynomial growth in even though the real part is zero.
For nonlinear systems, the local-linearization picture from Chapter 1, §1.4 makes the same theorem applicable at each fixed point: the fixed point is asymptotically stable if the Jacobian’s eigenvalues all sit in the LHP. This is the principle that makes “check the eigenvalues” the first move in nearly every SSM stability analysis.
The Lyapunov equation
There’s an alternative characterization of asymptotic stability that doesn’t require computing eigenvalues. The continuous-time Lyapunov equation
(with a given symmetric positive-definite matrix) has a unique symmetric positive-definite solution if and only if is asymptotically stable. The function is then a Lyapunov function for the system: for all nonzero , so decreases along trajectories and forces them to the origin.
For SSM analysis, the Lyapunov-equation view is occasionally useful — for example, the controllability and observability Gramians of an LTI realization are Lyapunov solutions Antoulas (2005) — but the eigenvalue criterion is the more direct tool.
2.3 Lyapunov exponents: computation
The eigenvalues of work as a stability diagnostic for LTI systems, where is constant. For systems where the linearization varies with time (e.g. a trained recurrent network in inference mode, where the per-step Jacobian depends on the input), eigenvalues at any single don’t tell the long-run story. The right tool is the Lyapunov exponent, which generalizes the eigenvalue’s real part to time-varying systems.
Given a sequence of transition matrices (the Jacobians of an iterated map at points along a trajectory), let . The -th Lyapunov exponent is
where are the singular values of . The vector is the Lyapunov spectrum.
Three things make Lyapunov exponents the right object:
- They generalize eigenvalues to time-varying systems. For a constant Jacobian , where are the eigenvalues’ magnitudes (in the discrete case) or (in the continuous case). The connection back to Chapter 1’s eigenvalue classification is exact in the autonomous case.
- The maximal Lyapunov exponent measures sensitive dependence. A positive means infinitesimally close initial conditions diverge exponentially (chaos); is the edge-of-chaos regime; means contracting dynamics.
- They are robust to coordinate changes. Eigenvalues of the per-step Jacobian depend on the basis you write the state in; Lyapunov exponents do not.
The QR algorithm
Computing for large directly is numerically impossible: even for a well-behaved system with , after steps the largest singular value is and the smallest may be . The product matrix becomes either numerical zero or numerical infinity in finite-precision arithmetic, and finite-precision singular values lose all meaning. The fix, due to Benettin et al. Benettin et al. (1980) , is to factor the growth out at every step using QR decomposition.
The QR-based Lyapunov algorithm (the version that has become standard in dynamical-systems software):
Initialize Q_0 = I (identity), running sums L_i = 0 for i = 1..N.
For t = 1, 2, ..., T:
M_t = J_t @ Q_{t-1} # one-step forward propagation
Q_t, R_t = qr(M_t) # extract orthogonal + upper triangular
Update L_i += log |R_t[i,i]| for each i
At the end:
Lyapunov_i = L_i / T
The key insight: by re-orthonormalizing the propagated frame at every step, we factor the exponential growth into the diagonal of the upper-triangular rather than letting it accumulate in . The diagonal entries of are well-conditioned numbers near 1, and their logarithms sum nicely to give the Lyapunov exponents.
The companion lyapunov_qr.py implements this algorithm and applies it to the ring of coupled oscillators from Chapter 1, §1.4. The resulting spectrum’s structure — all exponents negative for the damped case, all near zero for the undamped case — matches the eigenvalue analysis exactly, validating the implementation.
2.4 A-stability: integrator regions of absolute stability
Switch perspectives. We now have a continuous system and we discretize it with some numerical integrator at step size . The integrator approximates the true continuous trajectory by a discrete recurrence
where is the integrator’s stability function, evaluated at the matrix argument via the matrix-functional-calculus convention. Different integrators give different :
- Forward (explicit) Euler: , so .
- Backward (implicit) Euler: , giving .
- Bilinear (Tustin / trapezoidal): .
- Exact (zero-order hold): . This is the integrator S4 uses by default.
The discrete recurrence is Lyapunov-stable if and only if every eigenvalue of satisfies (the discrete-time analog of the LHP criterion is the unit disk). Since the eigenvalues of are where are the eigenvalues of , the question reduces to: for which in the complex plane is ?
The region of absolute stability of a numerical integrator with stability function is
The integrator is A-stable if contains the entire closed left half-plane.
A-stability matters because it means: if the continuous system is asymptotically stable, the integrator preserves that stability at every step size . Without A-stability, you have to pick small enough that lands inside the (bounded) stability region.
The four examples above:
- Forward Euler: ⇔ in the closed disk of radius 1 around . Tiny region; not A-stable. For a continuous system with eigenvalue (fast decay), forward Euler requires to avoid blowup.
- Backward Euler: ⇔ ⇔ outside the open disk of radius 1 around . This contains the entire closed LHP. A-stable.
- Bilinear (trapezoidal): ⇔ exactly. The stability region is exactly the closed LHP. A-stable, and the boundary coincides with the imaginary axis — a property called L-stability when combined with as , which bilinear does not satisfy ().
- Zero-order hold (matrix exponential): ⇔ . The stability region is the closed LHP. A-stable and L-stable. This is the gold standard for LTI discretization — and it’s what S4 uses internally.
Why this matters for SSMs
The S4 family uses ZOH discretization, and is therefore stable at every step size for which the continuous system is stable. Mamba-1 uses ZOH as well. Mamba-3’s exponential-trapezoidal rule (Chapter 10) is a second-order generalization of bilinear that retains A-stability. The C1 research pilot (see Chapter 6) asks whether symplectic integrators — A-stable methods designed to preserve geometric invariants — can do better for complex-state SSMs whose has near-imaginary eigenvalues.
When a stability-region analysis fails, it usually fails dramatically: a forward-Euler discretization of a stiff system blows up in a handful of steps. This is exactly the failure mode that careful integrator choice prevents.
For full coverage of A-stability theory and a wider zoo of integrator stability functions, Hairer–Wanner’s stiff-systems volume Hairer & Wanner (1996) is the canonical reference.
2.5 BIBO stability
The third notion looks at the input–output mapping rather than the autonomous dynamics. For the system , , the output under a unit-impulse input is the impulse response
The system is BIBO-stable if its impulse response satisfies , equivalently if . Under this condition, any bounded input produces a bounded output — Young’s inequality on convolutions gives the precise bound .
A causal LTI system is BIBO-stable iff all poles of its transfer function lie in the open left half-plane.
Poles of are the values of where — i.e. the eigenvalues of — unless a pole–zero cancellation occurs because doesn’t “see” some eigenmode. This cancellation matters: a system can be internally unstable (some in the right half-plane) but BIBO-stable if those unstable modes are unobservable from and unreachable from . The Kalman canonical-decomposition theorem makes this precise; for our purposes, the eigenvalues of in the LHP are sufficient for both internal and BIBO stability.
For an SSM viewed as a sequence-to-sequence map, BIBO is the natural input–output stability concept: bounded inputs (typical of token embeddings) should produce bounded outputs (typical of pre-softmax logits). When training an SSM produces an whose eigenvalues drift into the right half-plane, both internal (Lyapunov) and input–output (BIBO) stability fail simultaneously, and the model’s outputs blow up. Tracking the eigenvalues of during training is therefore a single diagnostic for both failure modes.
2.6 Connection to Lyapunov functions
The Lyapunov-function method gives a non-spectral way to prove stability. For nonlinear systems where eigenvalues don’t directly apply, finding a function with , for , and for all proves asymptotic stability of .
For our LTI case , the Lyapunov equation (with p.d.) gives as a quadratic Lyapunov function — and its existence with p.d. is equivalent to asymptotic stability (as noted in §2.2). The interest in the function-method picture for our context is mainly conceptual: it explains why “energy” arguments work to prove stability of mechanical systems (the energy is the Lyapunov function), and it generalizes immediately to nonlinear systems where the eigenvalue criterion only gives local information at fixed points.
This connects forward to Chapter 6, where the question of which discretization preserves a Lyapunov function (or, more strongly, an energy invariant) leads naturally into symplectic methods.
2.7 What’s next
Chapter 3 develops the structured-linear-algebra vocabulary (SVD, Jordan form, condition number, Toeplitz/Vandermonde/Cauchy/semiseparable structure) needed for the SSM kernel constructions in Chapters 7–9. Chapter 4 takes the discretization story started in §2.4 and develops it systematically: order conditions, accuracy classes, the Butcher tableau, the bilinear and ZOH derivations in detail. Chapter 6 picks up the A-stability theme and pushes into implicit and structure-preserving integrators — symplectic schemes, geometric integration, and the C1 pilot’s home territory.
2.8 Exercises
Six problems. Inline-collapsible solutions for the shorter ones; full solutions for the longer theory problems in §2.9.
Exercise 2.1 (computation)
Use the eigenvalue criterion (Theorem 2.1) to classify the stability of for .
Solution
Characteristic polynomial: . Roots: . Both have , so the system is asymptotically stable. The non-zero imaginary part means trajectories spiral toward the origin (stable spiral).
Exercise 2.2 (computation)
For the matrix of Exercise 1.1, find the eigenvalues and classify the Lyapunov stability. Is the system asymptotically stable?
Solution
Eigenvalues: . Both have , so the system is Lyapunov-stable but not asymptotically stable — trajectories are bounded (centers / closed orbits) but do not decay to zero. This is the undamped harmonic oscillator’s marginal case.
Exercise 2.3 (computation)
The forward Euler method has stability function . For the continuous test problem with (a stiff system with fast time scale), find the maximum step size for which forward Euler is stable.
Solution
Stability condition: with . Algebra: ⇔ ⇔ . So forward Euler requires . Any produces exponentially growing iterates even though the continuous system decays. This is the classic motivation for implicit methods on stiff problems.
Exercise 2.4 (computation)
Show that the bilinear method maps the closed left half-plane onto the closed unit disk . (This is the Möbius map picture of the bilinear transform.)
Solution
Let with . Compute:
The denominator minus the numerator equals (since ). So numerator denominator and . Equality iff (purely imaginary ), so the imaginary axis maps to the unit circle and the open LHP maps to the open unit disk. ∎
Exercise 2.5 (theory) — solution in §2.9
State and prove the Lyapunov equation characterization of asymptotic stability: is asymptotically stable iff for every symmetric positive-definite , the equation has a unique symmetric positive-definite solution .
Exercise 2.6 (theory) — solution in §2.9
Prove the BIBO criterion of Theorem 2.3: a causal LTI system is BIBO-stable iff all poles of its transfer function lie in the open left half-plane. (Hint: Use the Laplace-transform / convolution duality, and exploit the representation of the impulse response as a sum of exponentials.)
2.9 Full solutions to theory exercises
Solution to Exercise 2.5
Forward direction (Lyapunov equation ⇒ asymptotic stability): Suppose is symmetric positive-definite and satisfies for some symmetric positive-definite . Define . Then for (since ), and along trajectories :
So is a strict Lyapunov function and trajectories satisfy for determined by the smallest eigenvalue of over the largest of . Hence and the origin is asymptotically stable.
Reverse direction (asymptotic stability ⇒ Lyapunov equation has p.d. solution): Suppose is asymptotically stable. Define
The integral converges absolutely because decays exponentially. is symmetric (transpose preserves the integrand structure) and positive-definite (since and is invertible). Direct computation:
using the decay of at . Uniqueness of follows from the linearity of the Lyapunov operator and the fact that its kernel is trivial when has no purely imaginary eigenvalues. ∎
Solution to Exercise 2.6
(⇒) BIBO-stable implies poles in open LHP. Suppose the system is BIBO-stable, i.e. . The Laplace transform then exists and is analytic for (since the integrand is bounded by which is integrable, the integral converges uniformly). is rational (the matrix-inverse formula gives , a ratio of polynomials with denominator ). A rational function analytic on the closed right half-plane has no poles there — so all poles of lie in the open LHP.
(⇐) Poles in open LHP implies BIBO-stable. Suppose all poles of lie in the open LHP. Partial-fraction decomposition gives where are the poles with multiplicity , and for all . The inverse Laplace transform of each term is , which is on because the exponential decay beats any polynomial growth in . Sum of functions is , so and BIBO holds.
(For the multi-input multi-output case, apply the same argument entrywise to .) ∎
2.10 Companion code
Two runnable companions in companions/ch02/jax/:
lyapunov_qr.py— implements the QR-based Lyapunov-exponent algorithm and applies it to the ring-of-oscillators system from Chapter 1stability_regions.py— plots the regions of absolute stability for forward Euler, backward Euler, bilinear (trapezoidal), and ZOH in the complex plane
To run from the repo root:
PYTHONPATH=. python companions/ch02/jax/lyapunov_qr.py
PYTHONPATH=. python companions/ch02/jax/stability_regions.py
Figures land in public/figures/ch02/ (referenced from §2.3 and §2.4 above).
Linear algebra for sequence models: structured matrices and conditioning
Eigenvalue and SVD decompositions, condition number, the four structured-matrix families (Toeplitz, Vandermonde, Cauchy, semiseparable), low-rank updates, and a Krylov-subspace primer — the linear-algebra vocabulary later chapters assume.
Linear algebra for sequence models: structured matrices and conditioning
3.1 Eigenvalue decomposition and Jordan normal form
Chapter 1’s matrix exponential built directly on the spectral structure of . The fundamental theorem is:
For every matrix , there exist matrices (invertible) and (block-diagonal with Jordan blocks on the diagonal) such that
is the Jordan normal form of ; the diagonal entries of are the eigenvalues of (counted with algebraic multiplicity); the size of each Jordan block equals the difference between the algebraic and geometric multiplicities. If every eigenvalue has equal algebraic and geometric multiplicity, all Jordan blocks have size 1 and is diagonal — is diagonalizable.
A Jordan block of size for eigenvalue looks like
i.e. eigenvalue on the diagonal, ones on the superdiagonal, zeros elsewhere. The matrix exponential of a Jordan block is
which is the source of the polynomial-in- factors mentioned in Chapter 1, §1.2 — they appear exactly when there are Jordan blocks of size .
For SSM analysis, the diagonalizable case dominates. Almost every arising from random initialization or from training is diagonalizable; the non-diagonalizable case is a codimension-one subset of parameter space and shows up only at carefully-tuned boundaries. In practice the assumption ” is diagonalizable” is essentially always valid; the Jordan form is the worst-case fallback.
3.2 Singular value decomposition
The eigenvalue decomposition requires to be square; for general (rectangular or possibly non-diagonalizable) matrices, the right tool is the singular value decomposition.
For every matrix , there exist orthogonal matrices , , and a diagonal-with-zeros matrix with non-negative entries on the diagonal, such that
The are the singular values of ; they are uniquely determined. The number of nonzero equals .
The SVD has three properties that make it the workhorse of numerical linear algebra:
- It always exists. Unlike the eigenvalue decomposition, no diagonalizability or even squareness assumption is needed.
- The singular values give the Frobenius and operator norms. (operator norm, equal to the largest singular value); .
- It reveals low-rank structure cleanly. Truncating the SVD at rank — keeping only the top singular values and corresponding columns of — gives the best rank- approximation to in both Frobenius and operator norms (the Eckart–Young theorem).
For square , the singular values are related to but distinct from the eigenvalues. The relationship (using descending order on both sides) lets you compute singular values via an eigenvalue problem on the Gram matrix — though in practice the QR-iteration-based SVD algorithm (LAPACK’s gesdd) is more numerically stable.
The SVD shows up explicitly in:
- Chapter 2’s Lyapunov-exponent computation (singular values of the propagated frame matrix).
- Chapter 8’s HiPPO matrix analysis (the conditioning of the projection operator is given by its SVD).
- Chapter 12’s delta-rule lineage (DeltaNet’s state update is a rank-1 SVD correction).
For a textbook treatment of the SVD’s properties and algorithms, Trefethen–Bau Trefethen & Bau (1997) Chapters 4–5 are the standard reference. Golub–Van Loan Golub & Van Loan (2013) covers the algorithmic details exhaustively.
3.3 Condition number
The condition number of a matrix (with respect to the operator norm) is
the ratio of the largest to smallest singular value (when is invertible; otherwise ). The condition number measures how much amplifies relative input perturbations into relative output perturbations: if and has relative error , then the relative error in the computed solution can be as large as .
The qualitative scale: is the orthogonal case (no amplification); means losing roughly decimal digits of precision when solving a linear system. Double-precision floating point has about 16 decimal digits; matrices with are numerically singular.
For SSMs, condition number matters in three places:
- HiPPO matrix construction. The HiPPO-LegS matrix has structure that keeps its condition number bounded as grows — this is why HiPPO is the standard initialization. Random initializations typically give growing as or worse.
- S4 kernel computation. S4’s Vandermonde-Cauchy kernel construction requires inverting a matrix with structured but potentially ill-conditioned columns. The paper carefully handles the conditioning; naive implementations don’t.
- Mamba-3’s complex-state design. Chapter 10 discusses how Mamba-3 deliberately places eigenvalues in a region of the complex plane where the discrete-time map remains well-conditioned across the integration step.
3.4 Structured matrix families
Four classes of structured matrices appear throughout the SSM literature. Each is parameterized by numbers rather than the a general matrix needs, and each admits fast matrix-vector products via specialized algorithms.
Toeplitz matrices
A Toeplitz matrix is constant along each diagonal:
Parameterized by values . Toeplitz matrices are the matrix form of convolutions: if is Toeplitz with first column and first row (lower-triangular Toeplitz), then is the discrete convolution of the kernel with . The FFT-based convolution algorithm computes in time.
LTI SSM convolutional view (Chapter 8) is exactly this: the operator that maps the input sequence to the output sequence via is a Toeplitz matrix at the discretization level, with first column equal to the discretized impulse response .
Vandermonde matrices
A Vandermonde matrix has the form
parameterized by values . The defining property is that is a polynomial in evaluated at the node . So Vandermonde matrices implement polynomial evaluation at points as a linear map.
Vandermonde matrices are notoriously ill-conditioned for nodes on the real line (condition number can grow exponentially), but well-behaved for nodes on the unit circle — which is why the FFT (Vandermonde with being roots of unity) is numerically stable.
The S4 kernel computation uses Vandermonde structure: evaluating over is exactly the Vandermonde-style polynomial evaluation. The S4 paper Gu et al. (2022) uses Cauchy-matrix tricks (next subsection) to make this evaluation stable.
Cauchy matrices
A Cauchy matrix has entries
parameterized by values with the empty (to avoid zero denominators). Cauchy matrices are dense — every entry depends on both indices — but the very structured dependence enables fast algorithms: a Cauchy matrix-vector product can be computed in time using the fast multipole method.
Cauchy matrices appear in two places in the SSM literature: the S4 paper uses them as a numerically stable replacement for direct Vandermonde-style kernel evaluation, and the diagonal-plus-low-rank parametrization of in S4 has a Cauchy-matrix interpretation when viewed through the partial-fraction decomposition of its transfer function.
Semiseparable matrices
A rank- semiseparable matrix has the property that every submatrix lying strictly above the main diagonal (and every submatrix lying strictly below) has rank at most . Equivalently, the upper and lower triangular parts each have rank- structure.
The 1-semiseparable case — every off-diagonal block has rank at most 1 — is the structure exploited by Mamba-2’s SSD framework Dao & Gu (2024) . The 1-semiseparable lower-triangular matrix corresponding to a scalar-times-identity SSM is, explicitly,
where is the (scalar) recurrence coefficient at step . The entry for is the product . The SSD insight is that this matrix-vector product can be computed two ways: as a scan (the SSM view, time) or as a structured matrix multiply (the attention view, time but matmul-friendly on GPUs).
When is moderate (say ) and the GPU favors matmul over scan, the matrix view wins; when is huge, the scan view wins. Mamba-2’s chunked algorithm interpolates: matrix-multiplies within chunks of size , scans across chunks. This is the structural reason Mamba-2 is faster than Mamba-1 on long sequences without sacrificing parallelism.
3.5 Low-rank corrections and rank-1 updates
A recurring pattern: take a structured matrix (diagonal, banded, or semiseparable) and add a small low-rank correction where with . The resulting matrix retains nearly the storage and matvec efficiency of , and admits a closed-form inverse via the Sherman–Morrison–Woodbury identity:
The cost is dominated by inverting the matrix , not the outer matrix.
This pattern appears in S4 explicitly: the HiPPO-LegS matrix isn’t diagonal, but it is “diagonal plus low-rank” — specifically, normal plus low-rank, which is what makes the kernel computation tractable. The S4 paper’s main algorithmic contribution is a specialized Sherman–Morrison-style trick for this exact decomposition.
The pattern recurs even more visibly in DeltaNet (Chapter 12), where the state update is
The middle factor is a rank-1 correction (specifically, an identity minus a rank-1 outer product), and the whole expression is one explicit-Euler step of an online gradient-descent update on the per-token association loss. The Longhorn paper Dao & Gu (2024) (well, the linear-attention lineage more generally) frames this as a discretization of an implicit-rank-1-correction online ODE.
The key takeaway: structured + low-rank corrections give you efficient matrix-vector products without giving up expressiveness. This is the design pattern unifying S4, DeltaNet, GLA, and the Mamba-2 SSD framework — each is a different choice of base structure and correction pattern.
3.6 Krylov subspaces: a primer
The Krylov subspace of order generated by a matrix and a vector is
Krylov subspaces are the workhorse of iterative methods for large sparse linear systems: GMRES, conjugate gradient, Arnoldi iteration, Lanczos. All of them construct an orthonormal basis for and solve a small () projected problem instead of the original one.
For SSMs, the Krylov picture is conceptual rather than algorithmic. The point is that contains all the information the recurrence can extract from the initial condition in steps. If the system has eigenvalues that are decoupled from the initial direction (the so-called “unreachable subspace”), the recurrence cannot recover them, and the effective dimension of the SSM is . This is one rigorous reading of the Mamba copying-limitation results — the recurrence’s expressive ceiling is set by the Krylov dimension of relative to the input.
A full treatment of Krylov methods would fill its own chapter; the curriculum revisits the picture in Chapter 8 (where the S4 kernel can be viewed as a structured Krylov-projection problem) and in Chapter 16’s empirical-methodology discussion of why some architectures can copy strings exponentially longer than others.
For a textbook coverage of Krylov-subspace methods, Trefethen–Bau Trefethen & Bau (1997) Chapters 32–40 give the full algorithmic treatment.
3.7 What’s next
You now have the linear-algebra vocabulary the rest of the book assumes. Chapter 4 picks up the discretization thread (started in Chapter 2, §2.4) and develops it systematically: order conditions, accuracy classes, the Butcher tableau, the bilinear and ZOH derivations in detail. Chapter 7 introduces the HiPPO theory that connects orthogonal-basis approximation theory to the matrix structure of S4. Chapter 8 then shows how all four structured-matrix families combine in the S4 / S4D / S5 family.
If you’re impatient, Chapter 9’s Mamba-1/2/3 presentation is the payoff for the SSD discussion of §3.4 — the selective scan’s matmul-friendly chunkwise algorithm is exactly the 1-semiseparable matrix product algorithm.
3.8 Exercises
Six problems. Inline solutions for the shorter ones; full proofs for the theory exercises in §3.9.
Exercise 3.1 (computation)
Compute the Jordan normal form of .
Solution
The matrix is already in Jordan form: a single Jordan block with eigenvalue . There is one eigenvalue (algebraic multiplicity 2) but only one linearly independent eigenvector (geometric multiplicity 1), giving a defective . The transformation matrix is the identity ( directly).
Exercise 3.2 (computation)
Compute the singular values of and its operator norm.
Solution
The matrix is already in SVD form (with and ). Singular values are , . Operator norm: . (Note that swapping the diagonal entries doesn’t change the SVD; singular values are always sorted descending.)
Exercise 3.3 (computation)
For the Vandermonde matrix with nodes , write out the matrix and compute its determinant. Compare to the closed-form Vandermonde determinant .
Solution
Direct computation: .
Closed form: . ✓
Exercise 3.4 (computation)
Verify the Sherman–Morrison identity numerically: pick a random invertible matrix , vectors , and check that matches the closed-form expression to machine precision.
Solution
import numpy as np
rng = np.random.default_rng(0)
S = rng.standard_normal((3, 3))
u = rng.standard_normal(3)
v = rng.standard_normal(3)
direct = np.linalg.inv(S + np.outer(u, v))
S_inv = np.linalg.inv(S)
factor = 1.0 + v @ S_inv @ u
sherman = S_inv - np.outer(S_inv @ u, v @ S_inv) / factor
print(np.allclose(direct, sherman)) # TrueThe factor must be non-zero (the Sherman–Morrison identity fails when it is, indicating that is singular).
Exercise 3.5 (theory) — solution in §3.9
Prove the Eckart–Young theorem: the best rank- approximation to a matrix in the Frobenius norm is obtained by truncating its SVD at rank . That is, if with singular values , then minimizes over all matrices of rank .
Exercise 3.6 (theory) — solution in §3.9
Prove that a Toeplitz matrix with for all (operator-norm-bounded ) admits a matrix-vector product in time via the FFT. Specifically, show that any Toeplitz can be embedded in a circulant matrix, whose matvec is exactly the FFT-IFFT pair applied to the kernel and input.
3.9 Full solutions to theory exercises
Solution to Exercise 3.5
The proof uses two ingredients: (a) the SVD’s unitary invariance of the Frobenius norm, and (b) the optimality of truncated diagonal matrices.
Setup. Let be the SVD with and . For any of rank , write where also has rank (rank is invariant under invertible transformations). The Frobenius norm is invariant under orthogonal transformations:
So the problem reduces to: minimize over rank- matrices .
Diagonal reduction. Write with its own SVD where . Then
The middle trace term by the von-Neumann trace inequality (with equality when ‘s singular vectors align with ‘s — i.e. is diagonal in the same basis as ). Optimizing to maximize this term subject to rank gives for and for , i.e. .
Substituting back, the minimum value is , achieved by the truncated SVD . ∎
(The proof for the operator norm follows the same structure but uses Weyl’s interlacing inequality instead of von-Neumann’s trace inequality; see Golub–Van Loan Golub & Van Loan (2013) for the detailed argument.)
Solution to Exercise 3.6
Let be Toeplitz with entries for some kernel .
Step 1 — Circulant embedding. Define a circulant matrix with first column
The first rows and first columns of exactly reproduce (the zero in the middle of is the “buffer” that prevents wrap-around contamination).
Step 2 — Padded matvec. Given , form (zero-padded). Then : the first entries of the circulant product are exactly the Toeplitz product, because the zero-padding ensures the wrap-around portion of the convolution doesn’t contaminate the top half.
Step 3 — Circulant matvec via FFT. Every circulant matrix is diagonalized by the discrete Fourier transform: , where is the -point DFT matrix. So
where is element-wise product. Both FFTs and the IFFT take time; the element-wise product is .
Total cost. for the FFTs + for the element-wise multiply + to extract the first entries = . ∎
This is the exact algorithm used to compute LTI SSM convolutions in S4-family implementations: precompute the kernel once, then convolve with any input in time.
3.10 Companion code
Two runnable companions in companions/ch03/jax/:
condition_number.py— plots κ(A) growth for random Gaussian, Hilbert, and HiPPO-LegS matrices as N grows; produces Figure 3.1structured_matrices.py— constructs Toeplitz / Vandermonde / Cauchy / 1-semiseparable matrices for N=8 and visualizes structural patterns; produces Figure 3.2
To run from the repo root:
PYTHONPATH=. python companions/ch03/jax/condition_number.py
PYTHONPATH=. python companions/ch03/jax/structured_matrices.py
Figures land in public/figures/ch03/.
Discretization theory: ZOH, bilinear, exponential families
One-step discretizations of linear ODEs — zero-order hold, bilinear (Tustin), and exponential-trapezoidal — with the Lax equivalence theorem as the unifying convergence criterion.
Discretization theory: ZOH, bilinear, exponential families
4.1 The discretization problem
Chapter 1 wrote the linear state-space system as a continuous-time ODE
defined for every . Computers do not handle “every ”; they handle a finite grid for with step size . A discretization is a rule for building a recurrence
whose iterates approximate the continuous trajectory at the grid points, . The two discrete matrices are functions of the continuous matrices and of the step size . Different functions give different discretizations.
There is no single canonical choice. Every discrete recurrence in this book — the S4 layer, the Mamba selective scan, the exp-trapezoidal scheme of Mamba-3 — picks a specific rule for translating . The choice matters: a poor discretization can take a perfectly stable continuous system and produce a discrete recurrence that diverges, fails to converge as , or accumulates phase error on long horizons. Numerical-analysis textbooks Hairer et al. (1993) spend hundreds of pages on these failure modes. We will compress them into three properties and three schemes.
A useful warm-up is forward Euler, the simplest possible scheme:
This is what you get by replacing with the forward difference and freezing the input at . The discrete matrices are and . Forward Euler is the cleanest illustration of what can go wrong: it is only conditionally stable, meaning that for any fixed with there is a maximum step size beyond which acquires an eigenvalue of modulus and the recurrence blows up. The three schemes in §4.3–§4.5 fix this defect — each in its own way, with its own trade-off.
4.2 The Lax equivalence theorem
To compare schemes you need a definition of “correct.” The standard one comes from the Lax equivalence theorem Lax & Richtmyer (1956) , which decomposes correctness into two ingredients: consistency and stability.
A one-step scheme is consistent with the ODE if its local truncation error — the residual you obtain by plugging the exact continuous trajectory into the discrete recurrence — vanishes faster than as . Formally,
If the scheme is -th order consistent. Higher means the per-step error shrinks faster with .
A scheme is zero-stable if small perturbations injected at one step propagate boundedly to all future steps. For linear schemes this reduces to a uniform bound on the powers of : there exists independent of such that for all with , for any fixed horizon . Equivalently, the spectral radius .
The two ingredients combine.
(Lax equivalence.) For a linear, well-posed initial-value problem, a one-step scheme is convergent — meaning as — if and only if it is both consistent and zero-stable.
Convergence is what you actually want: that the discrete iterates approach the true continuous trajectory as you refine the grid. The theorem says you cannot get there by being consistent alone (you also need stability) or by being stable alone (you also need consistency). Both are necessary, and together they are sufficient. This is the convergence criterion every scheme in this chapter must pass.
For a stable LTI test problem — meaning has all eigenvalues in the open left half-plane — there is a stronger notion sitting on top of zero-stability: a scheme is A-stable if for every step size , not just for sufficiently small . Forward Euler is not A-stable; ZOH, bilinear, and exp-trapezoidal are. The geometry of A-stability — the set of values for which the scheme behaves well — is the subject of Chapter 5.
4.3 Zero-order hold (ZOH)
The first scheme assumes the input is piecewise constant between samples: for . Under that assumption the inhomogeneous ODE solves exactly on a single interval. Starting from ,
Identifying the two matrices,
This is the zero-order hold (ZOH) discretization. The hold refers to the input assumption; “zero-order” refers to the polynomial degree of the held signal (a constant is a zero-degree polynomial).
A practical wrinkle: writing as requires to be invertible. The HiPPO matrix of Chapter 7 is invertible; some structured in later chapters are not. The augmented matrix exponential trick handles both cases uniformly. Build the block matrix
where is the input dimension. Then
and a single expm call extracts both (top-left block) and (top-right block) without ever inverting . The S4 reference implementation uses this trick; so does the JAX companion discretization_comparison.py.
(ZOH preserves Lyapunov stability.) If every eigenvalue of has and , then every eigenvalue of satisfies . ZOH is therefore A-stable.
The proof is one line: when and . Stability is preserved for every positive step size, which is what A-stability means.
Two further properties are worth highlighting. First, for the autonomous ODE (), ZOH is exact: the recurrence matches without any error at all. This is unusual — most discretizations have nonzero error even on autonomous systems. Second, for forced systems with smooth , ZOH is first-order accurate: the per-step error in the forcing integral is (because the integrand is over an interval of length ), and after steps the cumulative error is . The empirical convergence rate plotted in the companion log-log error figure confirms slope .
4.4 The bilinear (Tustin) transform
The second scheme drops the piecewise-constant assumption and instead approximates the integral on by the trapezoidal rule: . Applied to the inhomogeneous ODE this gives an implicit recurrence,
which one solves for :
Reading off and ,
This is the bilinear transform, also called the Tustin transform after Arnold Tustin’s 1947 paper introducing it in control theory Tustin (1947) . The S4D paper uses bilinear discretization; so do many signal-processing toolboxes.
The bilinear formula has a striking geometric interpretation. Restricted to a single eigenvalue of (commuting through the simultaneous diagonalization), the map that sends a continuous eigenvalue to its discrete image is
This is a Möbius transformation (a.k.a. fractional linear transformation) of the complex plane: with . Möbius transformations are conformal — angle-preserving — and they map circles-or-lines to circles-or-lines. The specific Möbius map above sends the imaginary axis exactly to the unit circle and sends the open left half-plane exactly to the open unit disk.
(Bilinear preserves Lyapunov stability.) If every eigenvalue of has and , then every eigenvalue of satisfies . Bilinear is therefore A-stable.
The proof is geometric: has , and iff (square both sides and cancel). Algebraically, , which is exactly . The full proof, with the Möbius geometry spelled out, is Exercise 4.5 in §4.9.
Bilinear is second-order accurate for smooth forcing — the trapezoidal-rule local truncation error is , giving global error . Unlike ZOH, it is not exact even on autonomous systems: the Padé approximation of agrees with the true exponential only through second order. The trade is one of structure: bilinear maps the imaginary axis exactly to the unit circle, so marginally stable continuous modes (purely imaginary eigenvalues) become marginally stable discrete modes (eigenvalues exactly on the unit circle). ZOH crushes them to the unit-disk interior at rate (marginal in the limit but never exactly on the circle for finite ). This makes bilinear the natural choice when the system has oscillatory modes you want to preserve — a property that becomes load-bearing for the symplectic methods of Chapter 6.
4.5 Exponential-family discretizations
The third family takes a different approach: rather than approximating by a low-order Padé form (as bilinear does) or by the identity-plus-linear truncation (as forward Euler does), it computes exactly and approximates only the forcing integral. This is the philosophy of exponential integrators Hochbruck & Ostermann (2010) .
The exact one-step formula from variation of parameters (Chapter 1, §1.6) is
ZOH approximates on ; bilinear approximates the integral by the trapezoidal rule with a Padé-approximated exponential. The exponential-trapezoidal scheme keeps the exact exponential and uses a linear interpolation of the input: . Substituting and evaluating the resulting two integrals gives
where and are the first two -functions of the exponential family:
Both are entire functions (the apparent singularities at are removable: , ). Applied to the matrix , they produce matrix-valued objects that can be computed via the same augmented-matrix-exponential trick as ZOH.
Reading off the discrete matrices in the form :
Notice that is the same as ZOH, so exp-trapezoidal preserves stability identically to ZOH: A-stable, exact for autonomous systems, eigenvalues in the unit disk for any . The improvement is in the order of accuracy: by interpolating linearly rather than holding it constant, exp-trapezoidal becomes second-order accurate (provided is ).
Exp-trapezoidal is the discretization of choice when the continuous dynamics are stiff — when has eigenvalues with very different magnitudes, so the linear part is the hard part of the problem. By treating exactly the scheme sidesteps the step-size restriction that explicit methods inherit from the stiff eigenvalues, and the matrix exponential is computed once per layer in practice. Chapter 10 returns to this story for Mamba-3, which adopts exp-trapezoidal precisely because the input-dependent in selective SSMs produces stiff dynamics that ZOH and bilinear handle poorly.
A subtle implementation point: is not numerically equal to when this formula is computed naively, because catastrophic cancellation in the numerator destroys precision for small . The standard remedy is to compute all -functions simultaneously from one augmented matrix exponential, again using the trick that produced for ZOH. Both companions (exp_trapezoidal.py in JAX and discretization_atlas.jl in Julia) implement the augmented form.
4.6 Order of accuracy and the discretization hierarchy
The three schemes are summarized in the table below. “Autonomous-exact” means the scheme produces zero error for the homogeneous problem (). “A-stable” means for every on every stable LTI test problem.
| Scheme | Order | Autonomous-exact | A-stable | |
|---|---|---|---|---|
| Forward Euler | 1 | No | No | |
| ZOH | 1 | Yes | Yes | |
| Bilinear (Tustin) | 2 | No | Yes | |
| Exp-trapezoidal | 2 | Yes | Yes |
The empirical order can be read directly off a log-log plot of error against step size: a -th-order scheme has error , so on a log-log plot the data lie on a line of slope . The Julia companion discretization_atlas.jl runs this sweep against a high-accuracy Tsit5 reference solution from DifferentialEquations.jl; the JAX companion discretization_comparison.py performs the same sweep using scipy.integrate.solve_ivp (Radau) as the reference.
The hierarchy is not strict: ZOH’s autonomous-exactness can outweigh its first-order accuracy when the forcing is mild relative to the homogeneous decay. Bilinear’s exact imaginary-axis preservation can outweigh its non-exactness on autonomous problems when the system has long-lived oscillations. Exp-trapezoidal combines the best of both — exact on autonomous, second-order on forced — at the cost of computing more -functions per step. The trade-offs are recurring themes throughout the SSM literature: S4 uses ZOH for its simplicity and exactness on the (autonomous) HiPPO drift; S4D and S5 follow suit; Mamba-3 switches to exp-trapezoidal once the input-dependence makes the dynamics stiff enough that the second-order error compounding matters.
4.7 What’s next
This chapter introduced three schemes and proved each preserves Lyapunov stability. Chapter 5 zooms in on the stability region of each scheme — the set of values for which — and develops the Butcher-tableau machinery for systematically constructing higher-order Runge–Kutta methods. Chapter 6 then asks what to do when even A-stability is not enough — when the system is stiff or Hamiltonian and you need L-stable implicit methods, or symplectic methods that preserve geometric invariants. The exp-trapezoidal scheme of §4.5 turns out to be a member of a much larger family of exponential integrators, all of which deserve a place in your numerical-analysis toolkit.
Forward-looking SSM connections: ZOH is the default in Chapters 7–9 (HiPPO, S4, S4D, S5, Mamba-1/2). Bilinear shows up in S4D when the HiPPO matrix is diagonalized. Exp-trapezoidal is the Mamba-3 scheme, treated in Chapter 10.
4.8 Exercises
Six problems mixing computation and theory. Short/numerical (4.1–4.3) have inline collapsible solutions; long/proof exercises (4.4–4.6) have full worked solutions in §4.9.
Exercise 4.1 (computation)
Compute for ZOH on the 1-D test problem , , . Then verify by hand that .
Solution
. The discrete eigenvalue is , strictly inside the unit disk. The continuous decay rate is per unit time; the discrete decay factor per step is , so after 10 steps (one continuous-time unit) the state is reduced by a factor — matching the continuous decay exactly. ZOH is autonomous-exact.
Exercise 4.2 (computation)
Compute for the bilinear transform on the same 1-D problem (, ). Compare against — which is smaller, and why?
Solution
. ZOH gives . The bilinear value is slightly smaller, meaning bilinear overdamps slightly relative to the true exponential. This is the second-order Padé approximation underestimating : the truncation in removes positive higher-order terms. For small the gap is ; here the difference is about .
Exercise 4.3 (computation + code)
Run companions/ch04/jax/discretization_comparison.py and verify empirically that ZOH has slope on the log-log error-vs- plot, bilinear and exp-trapezoidal have slope . What happens to the ZOH slope if you set the forcing (autonomous case)?
Solution
The companion’s error_sweep function prints slopes near 1.00, 2.00, 2.00. Setting in the autonomous case makes ZOH exact — the error drops to roundoff ( in float64) and the slope is meaningless. This is consistent with the autonomous-exactness property of §4.3: ZOH’s only error source for forced systems is the piecewise-constant approximation of .
Exercise 4.4 (theory) — solution in §4.9
Prove the augmented matrix exponential trick used in §4.3: that
for invertible , by computing the series term-by-term.
Exercise 4.5 (theory) — solution in §4.9
Prove that the Möbius transformation sends the open left half-plane bijectively to the open unit disk , and sends the imaginary axis bijectively to the unit circle (minus the point ).
Exercise 4.6 (theory) — solution in §4.9
Derive the exponential-trapezoidal scheme of §4.5 from the variation-of-parameters formula by substituting the linear interpolation into the forcing integral and evaluating term-by-term using the -function identities.
4.9 Full solutions to theory exercises
Solution to Exercise 4.4
The matrix satisfies, for ,
This is verified by induction: matches by construction, and as required. Now sum the matrix-exponential series:
where , using invertibility of to extract the leading . Plugging back gives the claimed result. ∎
This is exactly the identity used to compute and jointly from one expm call in §4.3: take and .
Solution to Exercise 4.5
Let . Imaginary axis to unit circle: for with ,
so lies on the unit circle. The map traces the unit circle once as ranges over , missing only the limit point .
Left half-plane to unit disk: write with . Then
Since , (both for where the signs match the absolute values, and for where they reverse). The -terms are identical in numerator and denominator. Therefore the numerator is strictly less than the denominator, so .
Bijectivity: Möbius transformations with are bijections of the Riemann sphere to itself. Restriction to the half-plane gives a bijection to the disk. The inverse is , which can be verified by direct computation. ∎
This Möbius geometry is the algebraic content of A-stability for bilinear: a scheme is A-stable iff its eigenvalue map sends the open left half-plane into the closed unit disk. Bilinear achieves this bijectively; ZOH does so as well but non-bijectively (it folds an infinite strip of width onto the unit disk, aliasing high-frequency continuous modes onto low-frequency discrete ones).
Solution to Exercise 4.6
Start from variation of parameters on :
Substitute the linear interpolant for :
with and . Change variables in :
where in the last step we substituted the definition applied with argument . For , change variables again , so and :
The remaining -weighted integral evaluates via integration by parts to , using the identity. Carefully tracking the algebra gives as claimed. Combining,
The local truncation error is for inputs (the linear interpolation has error over an interval of length , and the integral picks up an additional factor). Global error is therefore — second-order, as advertised.
4.10 Companion code
Three runnable companions for Chapter 4:
JAX (companions/ch04/jax/):
discretization_comparison.py— implements ZOH, bilinear, and forward Euler in JAX; runs the error-vs-step-size sweep on a forced damped oscillator; emitseigenvalue_migration.pngandorder_convergence.png.exp_trapezoidal.py— implements the exp-trapezoidal scheme via the augmented matrix exponential; verifies second-order convergence against the same forced oscillator; emitsexp_trap_convergence.png.
Julia (companions/ch04/julia/):
discretization_atlas.jl— ports the post_transformers Week-9 reference implementation to ssm-foundations; usesDifferentialEquations.jlTsit5as the ground-truth reference solver and reports the empirical slope per scheme.Project.toml/Manifest.toml— companion-local Julia environment pinningDifferentialEquationsandLinearAlgebra.
To run from the repo root:
# JAX (uses post_transformers .venv with scipy, numpy, matplotlib, jax pre-installed)
PYTHONPATH=. python companions/ch04/jax/discretization_comparison.py
PYTHONPATH=. python companions/ch04/jax/exp_trapezoidal.py
# Julia (first run will precompile DifferentialEquations.jl; ~2–5 minutes)
julia --project=companions/ch04/julia companions/ch04/julia/discretization_atlas.jl
All figures emit to public/figures/ch04/.
Stability regions, Butcher tableau, and order conditions
Runge–Kutta theory, the Butcher tableau formalism, order conditions, stability regions in the complex plane, A-stability and L-stability, and the Dahlquist barrier — applied to the Chapter 4 discretization atlas.
Stability regions, Butcher tableau, and order conditions
5.1 Runge–Kutta methods and the Butcher tableau
Chapter 4 introduced ZOH, bilinear, and exp-trapezoidal as three specific one-step schemes for the linear state-space system. They are members of a much larger family: one-step methods that compute from using only one previous state and possibly intermediate stage evaluations of the right-hand side. The Runge–Kutta (RK) family is the most studied subfamily.
For the autonomous initial-value problem — we drop the explicit input dependence for this chapter; everything generalizes — an -stage Runge–Kutta method computes
The are the stage derivatives — evaluations of at intermediate states. The constants define the method, where are the stage times. The Butcher tableau is the standard tabular notation
The method is explicit if is strictly lower triangular ( for ), in which case each depends only on previously computed . It is implicit otherwise, requiring a nonlinear system solve at each step.
Three classical examples make the formalism concrete.
Forward Euler. The simplest explicit method, :
The midpoint rule (RK2). A two-stage explicit method:
Classical RK4. The most-quoted explicit method, four stages:
Fourth-order globally, the workhorse of physics simulations for half a century.
The tableau abstraction was introduced by John Butcher in the 1960s and is now the lingua franca of one-step methods: every reference paper on numerical integration presents its method as a Butcher tableau. The constraint (consistency of stage times) and (consistency of the weighted sum) are imposed by every tableau worth its name. Without them the method does not even reproduce the trivial constant trajectory when .
5.2 Order conditions
How do we know that classical RK4 is fourth-order globally? The convergence order of an RK method is determined by order conditions: polynomial identities on the entries of that must hold for the local truncation error to be .
The derivation expands as a Taylor series and matches term-by-term against the RK output produced from . For a scalar autonomous ODE the leading conditions are:
- Order 1:
- Order 2: also
- Order 3: also and
- Order 4: four more conditions involving products of
For systems (the case we actually care about), the number of order conditions grows combinatorially: 5 conditions for order 4, 17 for order 5, 37 for order 6 Hairer et al. (1993) . Butcher organized these conditions using a bijection between order-conditions and rooted trees (each tree corresponds to an integer partition of -evaluations and gives one polynomial identity). The combinatorial explosion is one reason 8th-order RK methods are rarely used: there are over 200 order conditions to satisfy simultaneously, and the resulting tableaux have dozens of stages.
(Butcher order conditions.) Suppose is sufficiently smooth. An -stage Runge–Kutta method with tableau is of order — meaning — if and only if a specific finite set of polynomial conditions on holds (one per Butcher tree of order ). The Lax equivalence theorem (Chapter 4) then guarantees global convergence at order .
The proof is non-trivial and we will not reproduce it; the canonical reference is Hairer–Nørsett–Wanner Chapter II.2 Hairer et al. (1993) . For our purposes the takeaway is that the order of an RK method is a property of its tableau entries, derivable by algebraic manipulation alone. You can verify (Exercise 5.1) that RK4’s tableau satisfies all 4 order-4 conditions; you can also verify (Exercise 5.2) that an arbitrary two-stage explicit tableau cannot achieve order higher than 2, regardless of how you choose the free parameters.
5.3 The stability function and stability regions
Order conditions characterize how a method behaves as on a smooth problem. Stability characterizes how it behaves at fixed on a stiff problem — when the time scales in the dynamics are well-separated and at least one is much faster than .
The standard probe is the Dahlquist test equation
The exact solution is . Applying an -stage RK method to this scalar problem produces, after one step,
where is the stability function of the method. For an explicit RK method, is a polynomial in of degree :
where is the all-ones -vector and the inverse expands as a finite geometric series because is nilpotent (strictly lower-triangular). For an implicit RK method, is a rational function .
Three examples make this concrete.
Forward Euler. With , : .
Midpoint RK2. Working through the formula gives — the second-order Taylor polynomial of .
Classical RK4. — the fourth-order Taylor polynomial.
A pattern emerges: for a -th order explicit RK method, agrees with through terms of degree . This is no coincidence — order requires the method to reproduce correctly on the test problem through that order in . The error is near .
The stability region of the method is the set
\stabregion = \set{z \in \C : \abs{\stabfn(z)} \le 1}.On the test problem with in the left half-plane (so the exact solution decays), the method’s iterates stay bounded iff . The stability region is the constraint on that keeps the method well-behaved on stiff problems — equivalently, the set of values you can take for a given .
The stability region of forward Euler is the disk — a disk of radius 1 centered at . For a real-eigenvalue , this requires . The tighter this constraint becomes (large = stiff problem), the smaller the maximum step size — this is the step-size restriction that motivates implicit methods.
5.4 A-stability, L-stability, and the Dahlquist barrier
A method is A-stable if its stability region contains the entire closed left half-plane: . On a stable linear problem with , an A-stable method preserves stability for every positive step size — no step-size restriction at all. This is the gold standard for stiff problems.
A method is L-stable if it is A-stable and as . L-stability damps fast eigenmodes aggressively, which is desirable when you want the discrete method to not preserve fast oscillations the continuous solution decays. Backward Euler is L-stable; the trapezoidal rule (= bilinear from Chapter 4) is A-stable but not L-stable (its stability function tends to , not 0, as ). The distinction matters when the system has eigenvalues with very large negative real part: bilinear will preserve high-frequency oscillation modes that should physically be damped.
The fundamental theorem on A-stability of explicit methods is the Dahlquist barrier:
(Dahlquist barrier for explicit RK.) No explicit Runge–Kutta method is A-stable. The maximum order of an A-stable Runge–Kutta method (explicit or implicit) is unbounded, but explicit methods cannot achieve A-stability at any order.
The proof for the explicit case is direct: an explicit RK method has polynomial in , and any polynomial diverges as . So is bounded, and cannot contain the entire (unbounded) left half-plane. ∎
The implicit case is more subtle: there exists an A-stable Runge–Kutta method of every order, but the constructions are increasingly elaborate. The Gauss–Legendre family — Chapter 6’s main attraction — produces A-stable implicit methods of every even order. The -stage Gauss–Legendre method is order , optimally accurate among -stage implicit RK methods, and A-stable, and (a bonus for Hamiltonian systems) symplectic.
The practical consequence for SSM discretization: you cannot have an explicit, A-stable, high-order one-step method. You must pick at most two of the three properties (explicit, A-stable, order ≥ 3). The Chapter 4 schemes resolve this trilemma differently:
- Forward Euler: explicit, order 1, not A-stable.
- ZOH and exp-trapezoidal: A-stable and of arbitrary order on autonomous problems, by virtue of using the matrix exponential — but they sit outside the explicit-RK family (they have non-polynomial stability functions).
- Bilinear: A-stable, order 2, implicit in form (requires inverting ).
The exponential integrators of §4.5 are a way to evade the Dahlquist barrier: by computing exactly, the stability function inherits the full LHP-stability of the exponential, and the polynomial-order restriction is moved from to a different object (the approximation of the forcing by -functions). This is why exp-trapezoidal can be both “explicit in the input” (no nonlinear solve in ) and A-stable.
5.5 Stability regions of the Chapter 4 schemes
Reading the three Chapter 4 schemes through the lens of stability functions:
ZOH. From §4.3, , so for the scalar test problem . The stability region is — exactly the closed left half-plane. ZOH is A-stable, in fact also L-stable because as .
Bilinear (Tustin). From §4.4, . The Möbius geometry from Chapter 4 §4.4 / Exercise 4.5 says the LHP maps bijectively to the open unit disk, so the stability region is exactly the closed left half-plane — bilinear is A-stable. But as , so bilinear is not L-stable. This is the trapezoidal-rule failure mode: oscillatory artifacts on very stiff problems.
Exp-trapezoidal. The homogeneous stability function is identical to ZOH’s: on the test problem (the difference between ZOH and exp-trap shows up only in the forcing integrals; the homogeneous behavior is the same). So exp-trapezoidal is A-stable and L-stable, just like ZOH.
5.6 What’s next
This chapter has laid out the systematic framework for analyzing one-step methods: Butcher tableau, order conditions, stability functions, stability regions, A-stability and L-stability, and the Dahlquist barrier. Chapter 6 uses this framework to motivate the implicit family — backward Euler, BDF, and the Gauss–Legendre IRK schemes that achieve A-stability at every order — and the symplectic family, which trades order or A-stability for the preservation of geometric invariants. Chapter 6 is where the C1 niche pilot lives: symplectic integration of selective SSMs viewed as Hamiltonian flows on the hidden-state manifold.
A natural side path is embedded methods — pairs of RK tableaux of consecutive orders sharing stages, used for adaptive step-size control. The DP54 method (Dormand–Prince 5(4)) used by DifferentialEquations.jl’s default solver Tsit5 is a famous example. Embedded methods are excellent practical tools but tangential to the SSM pedagogy of this book; the companion discretization_atlas.jl references the RKF45 / DP54 tableaux for completeness.
5.7 Exercises
Six problems. Short/numerical (5.1–5.3) have inline collapsible solutions; long/proof exercises (5.4–5.6) have full worked solutions in §5.8.
Exercise 5.1 (computation)
Verify that classical RK4 satisfies the four order-conditions of order 4 — at least the simpler ones: , , . Use the tableau in §5.1.
Solution
From the tableau, and .
- . ✓
- . ✓
- . ✓
Exercise 5.2 (computation)
Show that an arbitrary two-stage explicit RK tableau (one free parameter , with , to be determined, ) cannot achieve order 3 no matter how are chosen.
Solution
For order 2 we need (trivially ) and , giving , . So the two-parameter family collapses to a one-parameter family of order-2 methods (midpoint , Heun , etc.).
For order 3 we additionally need and . The second condition uses , giving . Contradiction. So no two-stage explicit method achieves order 3.
This is a small instance of the general fact: order requires at least stages for explicit methods (Butcher’s order barriers).
Exercise 5.3 (computation + code)
Run companions/ch05/jax/order_verification.py. Confirm the empirical slopes match Forward Euler (1), midpoint RK2 (2), classical RK4 (4). What happens to the RK4 slope at very small (say )?
Solution
The companion’s order_verification function prints slopes , , for the three methods in turn. At very small , the RK4 error reaches the roundoff floor — typically in float64 — and the slope flattens out as machine precision starts to dominate the method error. This is why empirical order is always estimated from the two finest dt where the method error still exceeds roundoff, not from arbitrarily fine dt.
Exercise 5.4 (theory) — solution in §5.8
Prove that for an explicit Runge–Kutta method with tableau , the stability function is
where . Verify that for classical RK4 this evaluates to .
Exercise 5.5 (theory) — solution in §5.8
Prove the Dahlquist-barrier statement for explicit RK methods: that no explicit RK method is A-stable. (The proof was sketched in §5.4; flesh it out, using the fact that explicit methods have polynomial stability functions.)
Exercise 5.6 (theory) — solution in §5.8
For the bilinear/Tustin scheme, show that the stability function satisfies as along any direction with bounded, and explain why this rules out L-stability. Discuss what this means for the discrete dynamics of a bilinear-discretized stable LTI system with very fast-decaying eigenmodes.
5.8 Full solutions to theory exercises
Solution to Exercise 5.4
Apply the -stage explicit RK method to the test ODE , . The stage equations become
Set and let . The vector form is
For an explicit method is strictly lower-triangular, hence nilpotent (), and the geometric series truncates. So , and
For classical RK4, direct computation of — expanding the geometric series and multiplying out — gives, after some algebra,
This is the fourth-order Taylor polynomial of , as expected for a fourth-order method.
Solution to Exercise 5.5
For an explicit RK method, is strictly lower-triangular, so and the geometric series for truncates at . Therefore
is a polynomial in of degree at most . Polynomials are unbounded: as along any ray (assuming the leading coefficient is nonzero, which holds for any -stage method of order , the only interesting case).
The closed left half-plane is unbounded — it extends to along the real axis. The stability region \stabregion = \set{z : \abs{\stabfn(z)} \le 1} is therefore bounded. So cannot contain the entire LHP, and the method is not A-stable. ∎
The proof relies essentially on being a polynomial. Implicit methods have rational , which can be bounded at infinity if . This is the precise sense in which implicit methods evade the Dahlquist barrier.
Solution to Exercise 5.6
Write with and bounded. Then
For large the dominant terms are . More precisely,
This rules out L-stability: L-stability requires , not . ∎
Practical implication. For a stable LTI system with having an eigenvalue with very large — a fast-decaying mode — bilinear discretization at step size produces a discrete eigenvalue . The corresponding mode in the discrete dynamics therefore alternates sign at every step before decaying, rather than smoothly decaying. This trapezoidal-rule oscillation is the classic L-stability failure mode: the discrete solution exhibits spurious oscillations not present in the continuous solution. For SSMs this manifests as oscillatory artifacts in the hidden state when the system has eigenvalues with very negative real parts — exactly the regime where stiff implicit methods like BDF (Chapter 6) are preferred over bilinear.
5.9 Companion code
Two JAX companions and one Julia companion for Chapter 5.
JAX (companions/ch05/jax/):
stability_regions.py— computes and plots stability regions for forward Euler, midpoint RK2, classical RK4, ZOH, bilinear, and exp-trapezoidal in the complex plane; emitsrk_stability_regions.png,atlas_stability.png, andbutcher_tableaux.png.order_verification.py— implements RK1/RK2/RK4 from their tableaux; verifies empirical orders 1, 2, 4 on the forced damped oscillator from Chapter 4; emitsorder_verification.png.
Julia (companions/ch05/julia/):
butcher_tableau_zoo.jl— usesDifferentialEquations.jlsolver tables (OrdinaryDiffEqTsit5, etc.) to extract Butcher coefficients programmatically and print them side-by-side with the analytic forms used in §5.1.Project.toml/Manifest.toml— companion-local Julia environment.
To run from the repo root:
# JAX
PYTHONPATH=. python companions/ch05/jax/stability_regions.py
PYTHONPATH=. python companions/ch05/jax/order_verification.py
# Julia (first run will precompile DifferentialEquations.jl)
julia --project=companions/ch05/julia companions/ch05/julia/butcher_tableau_zoo.jl
All figures emit to public/figures/ch05/.
Implicit methods, stiff systems, symplectic integration
When explicit methods fail — backward Euler, BDF, DIRK as A-stable alternatives; exponential integrators as the family of choice for selective SSMs; Hamiltonian mechanics and symplectic integration as the natural geometric language for hidden-state dynamics.
Implicit methods, stiff systems, symplectic integration
6.1 Why explicit methods fail: stiffness
Chapter 5 ended with the Dahlquist barrier: no explicit Runge–Kutta method is A-stable. The barrier matters because of a phenomenon called stiffness — when the dynamics matrix has eigenvalues with very different magnitudes.
A linear ODE is stiff if has at least one eigenvalue with for a time-scale much smaller than the simulation horizon, and the eigenvalue is paired with a long-lived (small-) mode that one actually wants to track. Stiffness is therefore relative: a system is stiff with respect to a given horizon and given accuracy goal, not in absolute terms. The Robertson chemical-kinetics problem (‘s spanning ten orders of magnitude) is the canonical textbook example; van der Pol at parameter is a popular ML-adjacent stand-in because it generates a nonlinear oscillation whose slow and fast phases differ by .
For an explicit method, the step size is bounded by the fastest eigenvalue: (forward Euler) or some other small constant times (RK4). If the slow eigenvalue we want to track is , the integrator takes times more steps than the slow time scale would suggest. For Robertson this ratio is . The simulation never finishes.
For an SSM, stiffness arises naturally for input-dependent dynamics. Mamba’s selective scan generates an effective at each token; for inputs that produce eigenvalues with very different magnitudes, the time-discretized recurrence is solving a stiff problem at each step. If the discretization is not A-stable, the recurrence either blows up or requires a step size — at which point you are running the model in an absurd regime. This is the structural reason Mamba-3 switched to exponential-trapezoidal (§4.5), which is A-stable in but explicit in .
6.2 Implicit methods: backward Euler, BDF, DIRK
The fix for stiffness is to abandon explicit methods. Backward Euler is the simplest implicit method:
Solving for requires either a linear solve (for linear ) or a Newton iteration (for nonlinear ). The pay-off is the stability function
which has for every with — backward Euler is A-stable. It is also L-stable: as , so fast modes get damped aggressively as desired on stiff problems.
(Backward Euler is L-stable.) Backward Euler’s stability function satisfies for every with , and as .
The proof is one line each: when , giving . The limit is immediate.
Backward Euler is first-order accurate. Higher-order implicit methods come in several families.
BDF (Backward Differentiation Formula, th-order) is a multi-step method that approximates by a -point backward difference and equates it to . BDF1 = backward Euler; BDF2 is second-order and L-stable; BDF for are conditionally stable but widely used in stiff ODE codes. BDF7+ is unstable. The order limit is the second Dahlquist barrier, a structural fact about multi-step methods that we will state without proof: no -stable linear multi-step method has order higher than 2.
Diagonally implicit RK (DIRK) sits between explicit RK and fully implicit RK. The matrix is lower triangular with non-zero diagonal, so each stage requires solving an independent nonlinear system (cheaper than the coupled system of fully implicit RK, but more expensive than the trivial explicit evaluations). Singly-DIRK (SDIRK) further constrains all diagonal entries to be equal, so the Newton Jacobian factorizes once per step. The Crank–Nicolson scheme is a 2-stage SDIRK of order 2.
Fully implicit RK — the most expensive family — includes Gauss–Legendre methods, which we will return to in §6.5 as the optimal symplectic methods.
The computational trade-off for implicit methods: each step costs (linear solve for the Jacobian factor) or (Newton iteration for nonlinear ). For SSMs with this is acceptable; for in the thousands it begins to dominate. The Chapter 4 exponential-trapezoidal scheme has the same A-stability as backward Euler but only costs one matrix-exponential per timestep (vs one Newton solve), and the matrix exponential of a low-rank or diagonalized is cheap. Mamba-3’s specific tactical advantage over a backward-Euler implementation comes down to this cost asymmetry.
6.3 Exponential integrators revisited
The exp-trapezoidal scheme of §4.5 is the simplest member of the exponential integrator family. Recall the general idea: given , treat the linear part exactly via and approximate only the forcing integral. The -function family provides building blocks for arbitrarily high order.
Three commonly used schemes:
- exp-Euler: . Treats the input as piecewise constant (like ZOH); first-order for forced systems, exact on autonomous problems.
- exp-midpoint: uses the input midpoint instead of ; second-order for symmetric input perturbations.
- exp-trapezoidal (§4.5): linear interpolation of the input; second-order for inputs.
- ETDRK4 Hochbruck & Ostermann (2010) : a four-stage exponential RK method that achieves fourth order. The Mamba-3 paper does not (yet) use ETDRK4; its second-order exp-trapezoidal is the empirical sweet spot.
All exponential integrators inherit A-stability from . They sidestep the Dahlquist barrier because their stability function is not a polynomial — it is times a rational-in- correction factor — and the polynomial constraint of explicit RK methods does not apply.
The implementation cost: one matrix exponential plus one or two -function evaluations per step. For a low-rank (which selective SSMs typically have, since the structured- pattern of Chapter 7 carries forward), the matrix exponential is cheap. For dense general exponentials cost via Padé / scaling-and-squaring, the same as a Newton solve — so the choice between exp-trap and backward Euler often comes down to whether you want autonomous-exactness (exp-trap wins) or L-stability damping of high-frequency modes (backward Euler wins).
6.4 Hamiltonian mechanics, energy conservation, and the symplectic structure
The remainder of this chapter introduces the language of Hamiltonian systems and symplectic integration. This is where the direct-transfer hook from continuum mechanics applies most cleanly. The exposition is brief and standalone; for the full theory see Hairer–Lubich–Wanner Hairer & Wanner (1996) Volume II Chapter VI.
A Hamiltonian system on phase space is governed by a scalar function — the Hamiltonian, intended to represent total energy — through Hamilton’s equations:
The canonical example is the harmonic oscillator , giving , . The pendulum is ; vortex-dynamics flows are higher-dimensional but structurally identical.
Three properties make Hamiltonian systems geometrically special.
Energy conservation. Along any trajectory , . The Hamiltonian is a constant of motion.
Phase-space volume preservation (Liouville’s theorem). The flow has Jacobian determinant 1 everywhere: . Volumes in phase space are preserved exactly under the dynamics.
Symplectic structure. Define the symplectic 2-form . The flow is a symplectic transformation: it preserves in the sense that where denotes the pullback. Symplecticity is a strict refinement of volume preservation — every symplectic transformation is volume-preserving, but not vice versa.
The reason these properties matter for numerics is that standard order-optimized integrators destroy them. A fourth-order RK method simulates the harmonic oscillator with local error per step — but it does not exactly preserve energy. Over steps, the energy error accumulates as per period times the number of periods — leading to linear-in-time energy drift that eventually destroys the trajectory’s qualitative behavior. RK4 on a pendulum will, over 1000 periods, gradually inflate the orbit and unbound the system. RK4 on a planetary orbit will (eventually) eject the planet. These are not numerical instabilities in the Chapter 5 sense — the method is A-stable for the small-amplitude regime — they are geometric errors: the wrong qualitative dynamics.
6.5 Symplectic integrators
A symplectic integrator is a numerical map that preserves the symplectic 2-form: . Symplectic integrators do not, in general, preserve exactly — but they preserve a nearby modified Hamiltonian for a -th order symplectic method. As a consequence, the energy along the discrete trajectory oscillates within a bounded band of width around its initial value — not drifting linearly with , but staying near a level set of . This is the backward error analysis of geometric integrators, and it is the practical reason symplectic methods are mandatory for long-horizon Hamiltonian simulations.
Three canonical schemes.
Symplectic Euler (1st order). For a separable Hamiltonian :
Update momentum first using position-derived force; then update position using the updated momentum. The asymmetry — using in the second step rather than — is what makes the scheme symplectic. A symmetric version updates first, then ; both are valid.
Störmer–Verlet (2nd order, symmetric). A symmetric version of symplectic Euler that does a half-step in momentum, full step in position, half-step in momentum:
This is the algorithm used in essentially every molecular-dynamics simulation. Second-order accurate, symplectic, time-reversible (the same algorithm run with exactly reverses the trajectory). The energy error is bounded by over arbitrarily long horizons.
Gauss–Legendre IRK (order , A-stable, symplectic). The -stage Gauss–Legendre Runge–Kutta method has nodes at the Gauss–Legendre quadrature points and is the unique -stage RK method of order . The 1-stage method is the implicit midpoint rule (order 2, symplectic, A-stable). The 2-stage method is order 4. Gauss–Legendre IRK is the unique family that is simultaneously A-stable, symplectic, and of optimal order — making it the gold standard for high-accuracy long-horizon Hamiltonian simulation.
(Modified Hamiltonian for symplectic methods.) For a symplectic integrator of order applied to a Hamiltonian system, there exists a modified Hamiltonian such that the discrete trajectory exactly preserves to within exponentially small terms over exponentially long times. The original therefore oscillates within a bounded band along the discrete trajectory.
The proof — Hairer–Lubich–Wanner Chapter IX — uses backward error analysis to expand the modified Hamiltonian as an asymptotic series in and shows the series can be truncated with controlled remainder. The takeaway for practice: a -th order symplectic integrator is exponentially better than a -th order non-symplectic one on Hamiltonian problems, even though the local truncation error of both is . Local error is misleading for long-horizon problems; geometric structure is what matters.
6.6 Toward selective SSMs as Hamiltonian flows
The C1 niche pilot proposes the following research program. The hidden-state dynamics of a selective SSM can sometimes be written as a Hamiltonian flow on the hidden-state manifold; when they can, applying a symplectic integrator should reduce long-horizon error in a way that order-optimized integrators cannot match. The first step is to characterize when this Hamiltonian structure exists.
For a linear-time-invariant SSM , the dynamics are Hamiltonian iff there exists a symmetric positive-definite such that . Equivalently, is similar (via ) to a skew-symmetric matrix. The quadratic Hamiltonian is then , and the symplectic 2-form is (in the standard symplectic-flat coordinates). For HiPPO-LegS the eigenvalues are all real-negative, not purely imaginary, so HiPPO-LegS dynamics are dissipative, not Hamiltonian. Symplectic methods do not apply. This is consistent with the empirical observation that ZOH and bilinear discretizations of HiPPO-LegS perform well — dissipation handles the long-horizon stability story.
For a selective SSM with input-dependent , the relevant question becomes: does have a Hamiltonian decomposition for some range of inputs, even if not for all? The C1 pilot’s empirical hypothesis is that during certain training phases the learned matrices drift toward eigenvalue spectra that are closer to purely imaginary than to the LHP, and that this drift is associated with the long-horizon recall artifacts observed in Mamba’s failure modes Anonymous (2025) . If this hypothesis holds, then symplectic discretization of selective SSMs should reduce those artifacts. The Chapter 17 pilot integration develops this thread; this chapter has supplied the necessary integration-theory vocabulary.
6.7 What’s next
This chapter completes the foundations layer of the book (Part I). The next part — Chapters 7–10 — introduces structured SSMs (S4, S4D, S5, Mamba-1/2, Mamba-3) as discretizations of the continuous systems developed here. Chapter 7 sets up HiPPO theory; Chapter 8 derives the S4 / S4D / S5 architectures and shows where bilinear discretization shines; Chapter 9 introduces selective scans (Mamba-1/2); Chapter 10 brings the exp-trapezoidal scheme of §4.5 to its home in Mamba-3.
The C1 pilot integration in Chapter 17 returns to this chapter’s symplectic story, applying the modified-Hamiltonian framework to learned selective dynamics. Readers focused on the SSM applications can skip to Chapter 7 now and consult §6.5–§6.6 when the symplectic discussion comes back.
6.8 Exercises
Six problems. Short/numerical (6.1–6.3) have inline collapsible solutions; long/proof exercises (6.4–6.6) have full worked solutions in §6.9.
Exercise 6.1 (computation)
Verify that backward Euler’s stability function is by directly applying the scheme to the test problem and solving for .
Solution
Backward Euler: . Rearranging, , so . Setting , the stability function is . ∎
Exercise 6.2 (computation)
For the harmonic oscillator , (so ), compute one step of symplectic Euler (“p first” variant) starting from at . What are , and what is ?
Solution
Symplectic Euler ( first): , then .
Starting from : . Then . So .
Energy: . Initial energy was . The energy is not exactly preserved (it changed by ), but it is bounded — over many steps it will oscillate, not drift.
Exercise 6.3 (computation + code)
Run companions/ch06/jax/symplectic_demo.py. Verify that classical RK4 produces linear-in-time energy drift on the harmonic oscillator while Störmer–Verlet’s energy stays in a bounded band. What is the drift rate (energy per period) of RK4 at ?
Solution
The companion’s main output shows two energy-vs-time traces over 100 periods. RK4 drifts approximately linearly; the drift rate is roughly per period at — consistent with local error compounded over steps per period. Störmer–Verlet’s energy stays within a band of width , with no secular drift. The contrast — linear drift vs bounded oscillation — is the key takeaway of §6.5.
Exercise 6.4 (theory) — solution in §6.9
Prove the L-stability of backward Euler (Theorem 6.1): for , and as .
Exercise 6.5 (theory) — solution in §6.9
Show that symplectic Euler is symplectic. That is, prove that the Jacobian has determinant exactly 1 for the harmonic oscillator. (Then note: for separable Hamiltonians, the same argument applies via the chain rule.)
Exercise 6.6 (theory) — solution in §6.9
Show that for a linear-time-invariant ODE with skew-symmetric (), the function is a constant of motion. Use this to give a sufficient condition (in terms of only) for an SSM to be Hamiltonian.
6.9 Full solutions to theory exercises
Solution to Exercise 6.4
Write with . Then . Since , , so . Therefore . ∎
For the L-stability limit: as with bounded, , so . ∎
Solution to Exercise 6.5
For the harmonic oscillator , , symplectic Euler ( first) gives
The Jacobian is
The determinant: . ∎
So the discrete map preserves the symplectic 2-form exactly. This is the algebraic content of “symplectic Euler is symplectic” — the determinant being exactly , not .
For a general separable Hamiltonian , symplectic Euler gives and . The Jacobian factorizes as
each of which has determinant 1 (lower- and upper-triangular with 1s on the diagonal). The product determinant is therefore 1. ∎
This “shear-decomposition” argument generalizes to all symplectic integrators that update and in alternating steps: each individual sub-step is a shear, with unit determinant. Verlet, leapfrog, Yoshida composition methods — all are products of unit-determinant shears.
Solution to Exercise 6.6
For with , compute
since by skew-symmetry. So — the squared norm is exactly preserved. ∎
Sufficient condition for an SSM to be Hamiltonian. Any SSM whose continuous dynamics matrix is skew-symmetric is Hamiltonian with Hamiltonian . The eigenvalues of a real skew-symmetric matrix are purely imaginary (or zero), so this regime is exactly the purely oscillatory corner of the SSM landscape — the opposite of the LHP-stable corner where HiPPO and S4 live. A more general sufficient condition: has a Hamiltonian structure (in the matrix-Lie-algebra sense) — equivalently, there exists a symmetric positive-definite with . The C1 pilot’s empirical project is to identify selective-SSM regimes where this condition holds approximately, and to apply symplectic discretization there.
6.10 Companion code
Two JAX companions and two Julia companions for Chapter 6.
JAX (companions/ch06/jax/):
stiff_demo.py— van der Pol oscillator at (mildly stiff); compares explicit RK4 (blows up at large ) against backward Euler (stable at every ); emitsstiff_blowup.png.symplectic_demo.py— harmonic oscillator + pendulum; runs classical RK4 vs Störmer–Verlet over 100 periods; emitsenergy_drift.pngandphase_portrait.png.
Julia (companions/ch06/julia/):
implicit_methods.jl— backward Euler, BDF2, and DIRK from-scratch implementations on a stiff test problem; uses onlyLinearAlgebra+Printf.symplectic_methods.jl— symplectic Euler, Störmer–Verlet, and the 2-stage Gauss–Legendre IRK method on a Hamiltonian test; emits empirical order-of-accuracy table.Project.toml/Manifest.toml.
To run from the repo root:
# JAX
PYTHONPATH=. python companions/ch06/jax/stiff_demo.py
PYTHONPATH=. python companions/ch06/jax/symplectic_demo.py
# Julia (lightweight — no DifferentialEquations.jl dependency)
julia --project=companions/ch06/julia companions/ch06/julia/implicit_methods.jl
julia --project=companions/ch06/julia companions/ch06/julia/symplectic_methods.jl
All figures emit to public/figures/ch06/.
HiPPO theory: orthogonal-basis projection operators
Legendre/Laguerre/Fourier bases; optimal polynomial approximation of input history.
HiPPO theory: orthogonal-basis projection operators
This chapter is planned. See the chapter-at-a-glance (foundations §3) for the full design; the crosswalk (foundations §22) maps this chapter to the relevant post_transformers/roadmap.md weeks.
LTI SSMs: S4, S4D, S5 as discretized continuous systems
Three discrete realizations of the same continuous SSM; convolutional vs recurrent views.
LTI SSMs: S4, S4D, S5 as discretized continuous systems
This chapter is planned. See the chapter-at-a-glance (foundations §3) for the full design; the crosswalk (foundations §22) maps this chapter to the relevant post_transformers/roadmap.md weeks.
Selective SSMs: Mamba-1, Mamba-2, SSD framework
Input-dependent parameters; selective scan; state-space duality with attention.
Selective SSMs: Mamba-1, Mamba-2, SSD framework
This chapter is planned. See the chapter-at-a-glance (foundations §3) for the full design; the crosswalk (foundations §22) maps this chapter to the relevant post_transformers/roadmap.md weeks.
Mamba-3 and the exponential-trapezoidal integrator
Complex-valued state, second-order discretization, MIMO formulation — ICLR 2026 Oral; C1 empirical anchor.
Mamba-3 and the exponential-trapezoidal integrator
This chapter is planned. See the chapter-at-a-glance (foundations §3) for the full design; the crosswalk (foundations §22) maps this chapter to the relevant post_transformers/roadmap.md weeks.
Linear attention and Hyena: long convolution and gated recurrence
Katharopoulos lineage; Hyena operator; gated linear attention (GLA/RetNet/HGRN2).
Linear attention and Hyena: long convolution and gated recurrence
This chapter is planned. See the chapter-at-a-glance (foundations §3) for the full design; the crosswalk (foundations §22) maps this chapter to the relevant post_transformers/roadmap.md weeks.
Delta-rule lineage: DeltaNet, Gated DeltaNet, Kimi Linear
Online-learning lens on SSMs; Longhorn (implicit) vs DeltaNet (explicit); chunkwise parallelization.
Delta-rule lineage: DeltaNet, Gated DeltaNet, Kimi Linear
This chapter is planned. See the chapter-at-a-glance (foundations §3) for the full design; the crosswalk (foundations §22) maps this chapter to the relevant post_transformers/roadmap.md weeks.
Exponential gates and matrix memory: xLSTM and RWKV-7
Exponential gating vs sigmoid; matrix-valued memory; vector-valued gating; in-context learning rates.
Exponential gates and matrix memory: xLSTM and RWKV-7
This chapter is planned. See the chapter-at-a-glance (foundations §3) for the full design; the crosswalk (foundations §22) maps this chapter to the relevant post_transformers/roadmap.md weeks.
Hybrid architectures and gating mechanisms
MAD methodology; production hybrids (Jamba/Bamba/Nemotron-H/SambaY); gating design space — B pilot home chapter.
Hybrid architectures and gating mechanisms
This chapter is planned. See the chapter-at-a-glance (foundations §3) for the full design; the crosswalk (foundations §22) maps this chapter to the relevant post_transformers/roadmap.md weeks.
Counter-evidence and diagnostic tools: where SSMs fail
TC^0 bounds; copying limitations; Gather-and-Aggregate mechanism; mechanistic evaluation.
Counter-evidence and diagnostic tools: where SSMs fail
This chapter is planned. See the chapter-at-a-glance (foundations §3) for the full design; the crosswalk (foundations §22) maps this chapter to the relevant post_transformers/roadmap.md weeks.
Empirical methodology: benchmark protocols and evaluation
Synthetic probes; regime-aware evaluation; the 5-axis decomposition B uses.
Empirical methodology: benchmark protocols and evaluation
This chapter is planned. See the chapter-at-a-glance (foundations §3) for the full design; the crosswalk (foundations §22) maps this chapter to the relevant post_transformers/roadmap.md weeks.
Niche-pilot integration: from curriculum to research output
How the 17-chapter foundations map to the C1 + B pilots and the broader 13-niche portfolio.
Niche-pilot integration: from curriculum to research output
This chapter is planned. See the chapter-at-a-glance (foundations §3) for the full design; the crosswalk (foundations §22) maps this chapter to the relevant post_transformers/roadmap.md weeks.