Part I · Foundations Week 3 Published

Linear algebra for sequence models: structured matrices and conditioning

Eigenvalue and SVD decompositions, condition number, the four structured-matrix families (Toeplitz, Vandermonde, Cauchy, semiseparable), low-rank updates, and a Krylov-subspace primer — the linear-algebra vocabulary later chapters assume.

On this page
  1. 3.1 Eigenvalue decomposition and Jordan normal form
  2. 3.2 Singular value decomposition
  3. 3.3 Condition number
  4. 3.4 Structured matrix families
  5. Toeplitz matrices
  6. Vandermonde matrices
  7. Cauchy matrices
  8. Semiseparable matrices
  9. 3.5 Low-rank corrections and rank-1 updates
  10. 3.6 Krylov subspaces: a primer
  11. 3.7 What’s next
  12. 3.8 Exercises
  13. Exercise 3.1 (computation)
  14. Exercise 3.2 (computation)
  15. Exercise 3.3 (computation)
  16. Exercise 3.4 (computation)
  17. Exercise 3.5 (theory) — solution in §3.9
  18. Exercise 3.6 (theory) — solution in §3.9
  19. 3.9 Full solutions to theory exercises
  20. Solution to Exercise 3.5
  21. Solution to Exercise 3.6
  22. 3.10 Companion code

Linear algebra for sequence models: structured matrices and conditioning

3.1 Eigenvalue decomposition and Jordan normal form

Chapter 1’s matrix exponential built directly on the spectral structure of A\statemat. The fundamental theorem is:

.

For every matrix ACN×N\statemat \in \C^{N \times N}, there exist matrices VV (invertible) and JJ (block-diagonal with Jordan blocks on the diagonal) such that

A=VJV1.\statemat = V J V^{-1}.

JJ is the Jordan normal form of A\statemat; the diagonal entries of JJ are the eigenvalues of A\statemat (counted with algebraic multiplicity); the size of each Jordan block equals the difference between the algebraic and geometric multiplicities. If every eigenvalue has equal algebraic and geometric multiplicity, all Jordan blocks have size 1 and JJ is diagonal — A\statemat is diagonalizable.

A Jordan block of size kk for eigenvalue λ\lambda looks like

Jk(λ)=(λ1λ1λ1λ)Ck×k,J_k(\lambda) = \begin{pmatrix} \lambda & 1 & & \\ & \lambda & 1 & \\ & & \ddots & \ddots \\ & & & \lambda & 1 \\ & & & & \lambda \end{pmatrix} \in \C^{k \times k},

i.e. eigenvalue on the diagonal, ones on the superdiagonal, zeros elsewhere. The matrix exponential of a Jordan block is

eJk(λ)t=eλt(1tt2/2!tk1/(k1)!1ttk2/(k2)!1t1),e^{J_k(\lambda) t} = e^{\lambda t} \begin{pmatrix} 1 & t & t^2/2! & \cdots & t^{k-1}/(k-1)! \\ & 1 & t & \cdots & t^{k-2}/(k-2)! \\ & & \ddots & \ddots & \vdots \\ & & & 1 & t \\ & & & & 1 \end{pmatrix},

which is the source of the polynomial-in-tt factors mentioned in Chapter 1, §1.2 — they appear exactly when there are Jordan blocks of size >1> 1.

For SSM analysis, the diagonalizable case dominates. Almost every A\statemat arising from random initialization or from training is diagonalizable; the non-diagonalizable case is a codimension-one subset of parameter space and shows up only at carefully-tuned boundaries. In practice the assumption ”A\statemat is diagonalizable” is essentially always valid; the Jordan form is the worst-case fallback.

3.2 Singular value decomposition

The eigenvalue decomposition requires A\statemat to be square; for general (rectangular or possibly non-diagonalizable) matrices, the right tool is the singular value decomposition.

.

For every matrix ARm×n\statemat \in \R^{m \times n}, there exist orthogonal matrices URm×mU \in \R^{m \times m}, VRn×nV \in \R^{n \times n}, and a diagonal-with-zeros matrix ΣRm×n\Sigma \in \R^{m \times n} with non-negative entries σ1σ2σmin(m,n)0\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_{\min(m,n)} \ge 0 on the diagonal, such that

A=UΣV.\statemat = U \Sigma V^\top.

The σi\sigma_i are the singular values of A\statemat; they are uniquely determined. The number of nonzero σi\sigma_i equals rank(A)\rank(\statemat).

The SVD has three properties that make it the workhorse of numerical linear algebra:

  1. It always exists. Unlike the eigenvalue decomposition, no diagonalizability or even squareness assumption is needed.
  2. The singular values give the Frobenius and operator norms. A2=σ1\norm{\statemat}_2 = \sigma_1 (operator norm, equal to the largest singular value); AF=iσi2\norm{\statemat}_F = \sqrt{\sum_i \sigma_i^2}.
  3. It reveals low-rank structure cleanly. Truncating the SVD at rank r<rank(A)r < \rank(\statemat) — keeping only the top rr singular values and corresponding columns of U,VU, V — gives the best rank-rr approximation to A\statemat in both Frobenius and operator norms (the Eckart–Young theorem).

For square A\statemat, the singular values are related to but distinct from the eigenvalues. The relationship σi(A)2=λi(AA)\sigma_i(\statemat)^2 = \lambda_i(\statemat^\top \statemat) (using descending order on both sides) lets you compute singular values via an eigenvalue problem on the Gram matrix AA\statemat^\top \statemat — though in practice the QR-iteration-based SVD algorithm (LAPACK’s gesdd) is more numerically stable.

The SVD shows up explicitly in:

  • Chapter 2’s Lyapunov-exponent computation (singular values of the propagated frame matrix).
  • Chapter 8’s HiPPO matrix analysis (the conditioning of the projection operator is given by its SVD).
  • Chapter 12’s delta-rule lineage (DeltaNet’s state update is a rank-1 SVD correction).

For a textbook treatment of the SVD’s properties and algorithms, Trefethen–Bau Trefethen & Bau (1997) Chapters 4–5 are the standard reference. Golub–Van Loan Golub & Van Loan (2013) covers the algorithmic details exhaustively.

3.3 Condition number

The condition number of a matrix ARN×N\statemat \in \R^{N \times N} (with respect to the operator norm) is

κ(A):=A2A12=σ1(A)σN(A),\kappa(\statemat) := \norm{\statemat}_2 \cdot \norm{\statemat^{-1}}_2 = \frac{\sigma_1(\statemat)}{\sigma_N(\statemat)},

the ratio of the largest to smallest singular value (when A\statemat is invertible; otherwise κ=\kappa = \infty). The condition number measures how much A\statemat amplifies relative input perturbations into relative output perturbations: if Ah=b\statemat \statevec = b and bb has relative error δb/b\delta b / \norm{b}, then the relative error in the computed solution can be as large as κ(A)δb/b\kappa(\statemat) \cdot \delta b / \norm{b}.

The qualitative scale: κ=1\kappa = 1 is the orthogonal case (no amplification); κ10k\kappa \approx 10^k means losing roughly kk decimal digits of precision when solving a linear system. Double-precision floating point has about 16 decimal digits; matrices with κ1016\kappa \gtrsim 10^{16} are numerically singular.

For SSMs, condition number matters in three places:

  1. HiPPO matrix construction. The HiPPO-LegS matrix has structure that keeps its condition number bounded as NN grows — this is why HiPPO is the standard initialization. Random initializations typically give κ\kappa growing as NO(1)N^{O(1)} or worse.
  2. S4 kernel computation. S4’s Vandermonde-Cauchy kernel construction requires inverting a matrix with structured but potentially ill-conditioned columns. The paper carefully handles the conditioning; naive implementations don’t.
  3. Mamba-3’s complex-state design. Chapter 10 discusses how Mamba-3 deliberately places eigenvalues in a region of the complex plane where the discrete-time map remains well-conditioned across the integration step.
Growth of the condition number of various structured matrices as size N increases.
Condition number κ(A) as a function of matrix size N for: a random Gaussian matrix (κ grows roughly linearly in N, modulo log factors); a Hilbert matrix (κ grows exponentially — the textbook example of ill-conditioning); the HiPPO-LegS matrix (κ stays bounded as N grows — its design feature). Produced by companions/ch03/jax/condition_number.py.

3.4 Structured matrix families

Four classes of structured matrices appear throughout the SSM literature. Each is parameterized by O(N)O(N) numbers rather than the O(N2)O(N^2) a general matrix needs, and each admits fast matrix-vector products via specialized algorithms.

Toeplitz matrices

A Toeplitz matrix is constant along each diagonal:

T=(t0t1t2t(n1)t1t0t1t2t1t0t2t1tn1t2t1t0).T = \begin{pmatrix} t_0 & t_{-1} & t_{-2} & \cdots & t_{-(n-1)} \\ t_1 & t_0 & t_{-1} & & \vdots \\ t_2 & t_1 & t_0 & \ddots & t_{-2} \\ \vdots & & \ddots & \ddots & t_{-1} \\ t_{n-1} & \cdots & t_2 & t_1 & t_0 \end{pmatrix}.

Parameterized by 2n12n - 1 values (t(n1),,tn1)(t_{-(n-1)}, \ldots, t_{n-1}). Toeplitz matrices are the matrix form of convolutions: if TT is Toeplitz with first column (t0,t1,,tn1)(t_0, t_1, \ldots, t_{n-1}) and first row (t0,0,,0)(t_0, 0, \ldots, 0) (lower-triangular Toeplitz), then ThT \statevec is the discrete convolution of the kernel (t0,t1,)(t_0, t_1, \ldots) with h\statevec. The FFT-based convolution algorithm computes ThT \statevec in O(nlogn)O(n \log n) time.

LTI SSM convolutional view (Chapter 8) is exactly this: the operator that maps the input sequence uu to the output sequence yy via y(t)=0th(ts)u(s)dsy(t) = \int_0^t h(t-s) u(s) ds is a Toeplitz matrix at the discretization level, with first column equal to the discretized impulse response h(0),h(Δ),h(2Δ),h(0), h(\stepsize), h(2\stepsize), \ldots.

Vandermonde matrices

A Vandermonde matrix has the form

V=(1x1x12x1n11x2x22x2n11xnxn2xnn1),V = \begin{pmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^{n-1} \\ 1 & x_2 & x_2^2 & \cdots & x_2^{n-1} \\ \vdots & & & & \vdots \\ 1 & x_n & x_n^2 & \cdots & x_n^{n-1} \end{pmatrix},

parameterized by nn values (x1,,xn)(x_1, \ldots, x_n). The defining property is that (Vh)i=j=0n1vjxij(V \statevec)_i = \sum_{j=0}^{n-1} v_j x_i^j is a polynomial in xix_i evaluated at the node xix_i. So Vandermonde matrices implement polynomial evaluation at nn points as a linear map.

Vandermonde matrices are notoriously ill-conditioned for nodes on the real line (condition number can grow exponentially), but well-behaved for nodes on the unit circle — which is why the FFT (Vandermonde with xix_i being roots of unity) is numerically stable.

The S4 kernel computation uses Vandermonde structure: evaluating jcjλjk\sum_j c_j \lambda_j^k over k=0,1,,L1k = 0, 1, \ldots, L-1 is exactly the Vandermonde-style polynomial evaluation. The S4 paper Gu et al. (2022) uses Cauchy-matrix tricks (next subsection) to make this evaluation stable.

Cauchy matrices

A Cauchy matrix has entries

Cij=1xiyj,C_{ij} = \frac{1}{x_i - y_j},

parameterized by n+mn + m values (x1,,xn,y1,,ym)(x_1, \ldots, x_n, y_1, \ldots, y_m) with the {xi}{yj}\{x_i\} \cap \{y_j\} empty (to avoid zero denominators). Cauchy matrices are dense — every entry depends on both indices — but the very structured dependence enables fast algorithms: a Cauchy matrix-vector product can be computed in O(nlog2n)O(n \log^2 n) time using the fast multipole method.

Cauchy matrices appear in two places in the SSM literature: the S4 paper uses them as a numerically stable replacement for direct Vandermonde-style kernel evaluation, and the diagonal-plus-low-rank parametrization of A\statemat in S4 has a Cauchy-matrix interpretation when viewed through the partial-fraction decomposition of its transfer function.

Semiseparable matrices

A rank-kk semiseparable matrix has the property that every submatrix lying strictly above the main diagonal (and every submatrix lying strictly below) has rank at most kk. Equivalently, the upper and lower triangular parts each have rank-kk structure.

The 1-semiseparable case — every off-diagonal block has rank at most 1 — is the structure exploited by Mamba-2’s SSD framework Dao & Gu (2024) . The 1-semiseparable lower-triangular matrix corresponding to a scalar-times-identity SSM is, explicitly,

M=(1a11a1a2a21a1a2a3a2a3a31),M = \begin{pmatrix} 1 & & & \\ a_1 & 1 & & \\ a_1 a_2 & a_2 & 1 & \\ a_1 a_2 a_3 & a_2 a_3 & a_3 & 1 \\ \vdots & & & \ddots \end{pmatrix},

where aia_i is the (scalar) recurrence coefficient at step ii. The (i,j)(i, j) entry for i>ji > j is the product aj+1aj+2aia_{j+1} a_{j+2} \cdots a_i. The SSD insight is that this matrix-vector product can be computed two ways: as a scan (the SSM view, O(L)O(L) time) or as a structured matrix multiply (the attention view, O(L2)O(L^2) time but matmul-friendly on GPUs).

When LL is moderate (say 8192\le 8192) and the GPU favors matmul over scan, the matrix view wins; when LL is huge, the scan view wins. Mamba-2’s chunked algorithm interpolates: matrix-multiplies within chunks of size 256\sim 256, scans across chunks. This is the structural reason Mamba-2 is faster than Mamba-1 on long sequences without sacrificing parallelism.

Heatmap visualizations of Toeplitz, Vandermonde, Cauchy, and semiseparable matrices showing their structural patterns.
Structure of the four matrix families for N=8: Toeplitz (constant along diagonals — convolution); Vandermonde (powers of node values in rows); Cauchy (each entry is 1/(x_i - y_j)); 1-semiseparable lower triangular (each lower-triangular entry is the product of a prefix of off-diagonal factors). Color encodes log|entry|. Produced by companions/ch03/jax/structured_matrices.py.

3.5 Low-rank corrections and rank-1 updates

A recurring pattern: take a structured matrix SS (diagonal, banded, or semiseparable) and add a small low-rank correction UVU V^\top where U,VRn×kU, V \in \R^{n \times k} with knk \ll n. The resulting matrix S+UVS + U V^\top retains nearly the storage and matvec efficiency of SS, and admits a closed-form inverse via the Sherman–Morrison–Woodbury identity:

(S+UV)1=S1S1U(I+VS1U)1VS1.(S + U V^\top)^{-1} = S^{-1} - S^{-1} U (I + V^\top S^{-1} U)^{-1} V^\top S^{-1}.

The cost is dominated by inverting the k×kk \times k matrix I+VS1UI + V^\top S^{-1} U, not the n×nn \times n outer matrix.

This pattern appears in S4 explicitly: the HiPPO-LegS matrix isn’t diagonal, but it is “diagonal plus low-rank” — specifically, normal plus low-rank, which is what makes the kernel computation tractable. The S4 paper’s main algorithmic contribution is a specialized Sherman–Morrison-style trick for this exact decomposition.

The pattern recurs even more visibly in DeltaNet (Chapter 12), where the state update is

St+1=St+βt(vtStkt)kt=St(Iβtktkt)+βtvtkt.S_{t+1} = S_t + \beta_t (v_t - S_t k_t) k_t^\top = S_t (I - \beta_t k_t k_t^\top) + \beta_t v_t k_t^\top.

The middle factor is a rank-1 correction (specifically, an identity minus a rank-1 outer product), and the whole expression is one explicit-Euler step of an online gradient-descent update on the per-token association loss. The Longhorn paper Dao & Gu (2024) (well, the linear-attention lineage more generally) frames this as a discretization of an implicit-rank-1-correction online ODE.

The key takeaway: structured + low-rank corrections give you efficient matrix-vector products without giving up expressiveness. This is the design pattern unifying S4, DeltaNet, GLA, and the Mamba-2 SSD framework — each is a different choice of base structure and correction pattern.

3.6 Krylov subspaces: a primer

The Krylov subspace of order kk generated by a matrix A\statemat and a vector bb is

Kk(A,b):=span{b,Ab,A2b,,Ak1b}.\mathcal{K}_k(\statemat, b) := \text{span}\{ b, \statemat b, \statemat^2 b, \ldots, \statemat^{k-1} b \}.

Krylov subspaces are the workhorse of iterative methods for large sparse linear systems: GMRES, conjugate gradient, Arnoldi iteration, Lanczos. All of them construct an orthonormal basis for Kk(A,b)\mathcal{K}_k(\statemat, b) and solve a small (k×kk \times k) projected problem instead of the original n×nn \times n one.

For SSMs, the Krylov picture is conceptual rather than algorithmic. The point is that Kk(A,b)\mathcal{K}_k(\statemat, b) contains all the information the recurrence ht+1=Aht+But\statevec_{t+1} = \statemat \statevec_t + \inputmat u_t can extract from the initial condition bb in kk steps. If the system has nkn - k eigenvalues that are decoupled from the initial direction bb (the so-called “unreachable subspace”), the recurrence cannot recover them, and the effective dimension of the SSM is k<nk < n. This is one rigorous reading of the Mamba copying-limitation results — the recurrence’s expressive ceiling is set by the Krylov dimension of A\statemat relative to the input.

A full treatment of Krylov methods would fill its own chapter; the curriculum revisits the picture in Chapter 8 (where the S4 kernel can be viewed as a structured Krylov-projection problem) and in Chapter 16’s empirical-methodology discussion of why some architectures can copy strings exponentially longer than others.

For a textbook coverage of Krylov-subspace methods, Trefethen–Bau Trefethen & Bau (1997) Chapters 32–40 give the full algorithmic treatment.

3.7 What’s next

You now have the linear-algebra vocabulary the rest of the book assumes. Chapter 4 picks up the discretization thread (started in Chapter 2, §2.4) and develops it systematically: order conditions, accuracy classes, the Butcher tableau, the bilinear and ZOH derivations in detail. Chapter 7 introduces the HiPPO theory that connects orthogonal-basis approximation theory to the A\statemat matrix structure of S4. Chapter 8 then shows how all four structured-matrix families combine in the S4 / S4D / S5 family.

If you’re impatient, Chapter 9’s Mamba-1/2/3 presentation is the payoff for the SSD discussion of §3.4 — the selective scan’s matmul-friendly chunkwise algorithm is exactly the 1-semiseparable matrix product algorithm.

3.8 Exercises

Six problems. Inline solutions for the shorter ones; full proofs for the theory exercises in §3.9.

Exercise 3.1 (computation)

Compute the Jordan normal form of A=(2102)\statemat = \begin{pmatrix} 2 & 1 \\ 0 & 2 \end{pmatrix}.

Solution

The matrix is already in Jordan form: a single 2×22 \times 2 Jordan block with eigenvalue λ=2\lambda = 2. There is one eigenvalue (algebraic multiplicity 2) but only one linearly independent eigenvector (geometric multiplicity 1), giving a defective J2(2)J_2(2). The transformation matrix VV is the identity (J=AJ = \statemat directly).

Exercise 3.2 (computation)

Compute the singular values of A=(300400)\statemat = \begin{pmatrix} 3 & 0 \\ 0 & 4 \\ 0 & 0 \end{pmatrix} and its operator norm.

Solution

The matrix is already in SVD form (with U=I3U = I_3 and V=I2V = I_2). Singular values are σ1=4\sigma_1 = 4, σ2=3\sigma_2 = 3. Operator norm: A2=σ1=4\norm{\statemat}_2 = \sigma_1 = 4. (Note that swapping the diagonal entries doesn’t change the SVD; singular values are always sorted descending.)

Exercise 3.3 (computation)

For the Vandermonde matrix with nodes (x1,x2,x3)=(1,2,3)(x_1, x_2, x_3) = (1, 2, 3), write out the matrix and compute its determinant. Compare to the closed-form Vandermonde determinant detV=i<j(xjxi)\det V = \prod_{i < j} (x_j - x_i).

Solution

V=(111124139).V = \begin{pmatrix} 1 & 1 & 1 \\ 1 & 2 & 4 \\ 1 & 3 & 9 \end{pmatrix}.

Direct computation: detV=1(2943)1(1941)+1(1321)=65+1=2\det V = 1 \cdot (2 \cdot 9 - 4 \cdot 3) - 1 \cdot (1 \cdot 9 - 4 \cdot 1) + 1 \cdot (1 \cdot 3 - 2 \cdot 1) = 6 - 5 + 1 = 2.

Closed form: (21)(31)(32)=121=2(2-1)(3-1)(3-2) = 1 \cdot 2 \cdot 1 = 2. ✓

Exercise 3.4 (computation)

Verify the Sherman–Morrison identity numerically: pick a random invertible 3×33 \times 3 matrix SS, vectors u,vR3u, v \in \R^3, and check that (S+uv)1(S + u v^\top)^{-1} matches the closed-form expression to machine precision.

Solution
import numpy as np
rng = np.random.default_rng(0)
S = rng.standard_normal((3, 3))
u = rng.standard_normal(3)
v = rng.standard_normal(3)
direct = np.linalg.inv(S + np.outer(u, v))
S_inv = np.linalg.inv(S)
factor = 1.0 + v @ S_inv @ u
sherman = S_inv - np.outer(S_inv @ u, v @ S_inv) / factor
print(np.allclose(direct, sherman))  # True

The factor 1+vS1u1 + v^\top S^{-1} u must be non-zero (the Sherman–Morrison identity fails when it is, indicating that S+uvS + u v^\top is singular).

Exercise 3.5 (theory) — solution in §3.9

Prove the Eckart–Young theorem: the best rank-kk approximation to a matrix A\statemat in the Frobenius norm is obtained by truncating its SVD at rank kk. That is, if A=UΣV\statemat = U \Sigma V^\top with singular values σ1σn\sigma_1 \ge \cdots \ge \sigma_n, then Ak:=ikσiuivi\statemat_k := \sum_{i \le k} \sigma_i u_i v_i^\top minimizes ABF\norm{\statemat - B}_F over all matrices BB of rank k\le k.

Exercise 3.6 (theory) — solution in §3.9

Prove that a Toeplitz matrix TRn×nT \in \R^{n \times n} with Th2Ch2\norm{T \statevec}_2 \le C \norm{\statevec}_2 for all h\statevec (operator-norm-bounded TT) admits a matrix-vector product in O(nlogn)O(n \log n) time via the FFT. Specifically, show that any n×nn \times n Toeplitz TT can be embedded in a 2n×2n2n \times 2n circulant matrix, whose matvec is exactly the FFT-IFFT pair applied to the kernel and input.

3.9 Full solutions to theory exercises

Solution to Exercise 3.5

The proof uses two ingredients: (a) the SVD’s unitary invariance of the Frobenius norm, and (b) the optimality of truncated diagonal matrices.

Setup. Let A=UΣV\statemat = U \Sigma V^\top be the SVD with Σ=diag(σ1,,σn)\Sigma = \diag(\sigma_1, \ldots, \sigma_n) and σ1σn0\sigma_1 \ge \cdots \ge \sigma_n \ge 0. For any BB of rank k\le k, write B=UB~VB = U \widetilde B V^\top where B~:=UBV\widetilde B := U^\top B V also has rank k\le k (rank is invariant under invertible transformations). The Frobenius norm is invariant under orthogonal transformations:

ABF=U(ΣB~)VF=ΣB~F.\norm{\statemat - B}_F = \norm{U(\Sigma - \widetilde B) V^\top}_F = \norm{\Sigma - \widetilde B}_F.

So the problem reduces to: minimize ΣB~F\norm{\Sigma - \widetilde B}_F over rank-k\le k matrices B~\widetilde B.

Diagonal reduction. Write B~\widetilde B with its own SVD B~=XDY\widetilde B = X D Y^\top where D=diag(d1,,dk,0,,0)D = \diag(d_1, \ldots, d_k, 0, \ldots, 0). Then

ΣB~F2=tr((ΣB~)(ΣB~))=ΣF22tr(ΣB~)+B~F2.\norm{\Sigma - \widetilde B}_F^2 = \tr((\Sigma - \widetilde B)^\top (\Sigma - \widetilde B)) = \norm{\Sigma}_F^2 - 2 \tr(\Sigma \widetilde B^\top) + \norm{\widetilde B}_F^2.

The middle trace term tr(ΣB~)ikσidi\tr(\Sigma \widetilde B^\top) \le \sum_{i \le k} \sigma_i d_i by the von-Neumann trace inequality (with equality when B~\widetilde B‘s singular vectors align with Σ\Sigma‘s — i.e. B~\widetilde B is diagonal in the same basis as Σ\Sigma). Optimizing did_i to maximize this term subject to rank k\le k gives di=σid_i = \sigma_i for iki \le k and di=0d_i = 0 for i>ki > k, i.e. B~=diag(σ1,,σk,0,,0)\widetilde B = \diag(\sigma_1, \ldots, \sigma_k, 0, \ldots, 0).

Substituting back, the minimum value is ΣB~F2=i>kσi2\norm{\Sigma - \widetilde B^*}_F^2 = \sum_{i > k} \sigma_i^2, achieved by the truncated SVD Ak=Udiag(σ1,,σk,0,)V=ikσiuivi\statemat_k = U \diag(\sigma_1, \ldots, \sigma_k, 0, \ldots) V^\top = \sum_{i \le k} \sigma_i u_i v_i^\top. ∎

(The proof for the operator norm follows the same structure but uses Weyl’s interlacing inequality instead of von-Neumann’s trace inequality; see Golub–Van Loan Golub & Van Loan (2013) for the detailed argument.)

Solution to Exercise 3.6

Let TRn×nT \in \R^{n \times n} be Toeplitz with entries Tij=tijT_{ij} = t_{i-j} for some kernel (t(n1),,tn1)(t_{-(n-1)}, \ldots, t_{n-1}).

Step 1 — Circulant embedding. Define a circulant matrix CR2n×2nC \in \R^{2n \times 2n} with first column

c=(t0,t1,t2,,tn1,0,t(n1),t(n2),,t1).c = (t_0, t_1, t_2, \ldots, t_{n-1}, 0, t_{-(n-1)}, t_{-(n-2)}, \ldots, t_{-1})^\top.

The first nn rows and first nn columns of CC exactly reproduce TT (the zero in the middle of cc is the “buffer” that prevents wrap-around contamination).

Step 2 — Padded matvec. Given hRn\statevec \in \R^n, form h~=(h,0)R2n\widetilde \statevec = (\statevec, 0)^\top \in \R^{2n} (zero-padded). Then (Ch~)1:n=Th(C \widetilde \statevec)_{1:n} = T \statevec: the first nn entries of the circulant product are exactly the Toeplitz product, because the zero-padding ensures the wrap-around portion of the convolution doesn’t contaminate the top half.

Step 3 — Circulant matvec via FFT. Every 2n×2n2n \times 2n circulant matrix CC is diagonalized by the discrete Fourier transform: C=F1diag(Fc)FC = F^{-1} \diag(F c) F, where FF is the 2n2n-point DFT matrix. So

Ch~=F1(diag(Fc)Fh~)=F1((Fc)(Fh~)),C \widetilde \statevec = F^{-1} \, (\diag(F c) \cdot F \widetilde \statevec) = F^{-1} \, ((F c) \odot (F \widetilde \statevec)),

where \odot is element-wise product. Both FFTs and the IFFT take O(nlogn)O(n \log n) time; the element-wise product is O(n)O(n).

Total cost. O(nlogn)O(n \log n) for the FFTs + O(n)O(n) for the element-wise multiply + O(n)O(n) to extract the first nn entries = O(nlogn)O(n \log n). ∎

This is the exact algorithm used to compute LTI SSM convolutions in S4-family implementations: precompute the kernel h(0),h(Δ),,h((L1)Δ)h(0), h(\stepsize), \ldots, h((L-1)\stepsize) once, then convolve with any input in O(LlogL)O(L \log L) time.

3.10 Companion code

Two runnable companions in companions/ch03/jax/:

  • condition_number.py — plots κ(A) growth for random Gaussian, Hilbert, and HiPPO-LegS matrices as N grows; produces Figure 3.1
  • structured_matrices.py — constructs Toeplitz / Vandermonde / Cauchy / 1-semiseparable matrices for N=8 and visualizes structural patterns; produces Figure 3.2

To run from the repo root:

PYTHONPATH=. python companions/ch03/jax/condition_number.py
PYTHONPATH=. python companions/ch03/jax/structured_matrices.py

Figures land in public/figures/ch03/.