Lecture 1
Introduction to Inverse Theory
What is Inverse Theory?
Inverse theory is a set of mathematical techniques used to infer the properties of a physical system from observations of its output. It is a fundamental tool in many scientific disciplines, including geophysics, seismology, and medical imaging. Inverse theory is used to solve a wide range of problems, such as:
- Parameter Estimation: Determining the values of unknown parameters in a model that best fit the observed data.
- System Identification: Identifying the structure and dynamics of a system from input-output data.
- Image Reconstruction: Reconstructing an image or object from noisy or incomplete measurements.
What many of these tasks have in common is that we are working with incomplete information. There is a forward problem that has generated the data we observe, \(\vec{b}\), from a set of input parameters \(\vec{x}\), and we want to solve the inverse problem of recovering those parameters from the data. However, the inverse problem is often ill-posed, meaning that there are multiple solutions that fit the data equally well. Inverse theory provides a framework for finding the best solution to these problems.
The forward problem can be described, for example, by a differential equation or operator \(L\) that acts on a field \(u\) and depends on model parameters \(x\):
\[ L(x)[u] = q \iff u = L^{-1}(x)[q] \]
For example, when making measurements of an electric potential field that depends on subsurface conductivity values, we have:
\[ \nabla \cdot ( \sigma \nabla u ) = q + \text{BC}\]
We measure \(u\) at some points and use those measurements to form an estimate of the conductivity \(\sigma\). The forward problem is to solve for \(u\) given \(\sigma\), and the inverse problem is to solve for \(\sigma\) given \(u\). The forward problem is often well-posed, while the inverse problem is often ill-posed.
For a computational framework, we can discretize the equation so that the operator is a matrix \(A\) and the data is a vector \(\vec{b}\):
\[ \underbrace{A}_{\text{Forward Map}} \underbrace{\vec{x}}_{\text{Model Parameters}} + \epsilon = \underbrace{\vec{b}}_{\text{Observed Data}} \]
In this case we may have a sparse set of measurements \(\vec{b}\) and a large set of unknowns \(\vec{x}\), making the problem underdetermined. The goal of inverse theory is to find the best estimate of \(\vec{x}\) given \(\vec{b}\).
Example: The Triathlon Problem
To illustrate the concept of inverse theory, consider the following example:
Suppose that you have agreed to meet a friend to watch them during a triathlon race, but you showed up late and missed the start. They expect you to have been there at the moment they changed from the running phase to the cycling phase, and to know the time at which they made the transition. However, you only know the overall start time and finish time of the race.
If the race starts at time \(t=0\) and ends at time \(t=b\), how do you use this information to deduce the actual time \(t_r \in [0,b]\) at which they crossed the transition zone of the race?
The first restriction on feasible solutions is the domain \([0,b]\), so we know that \(0 < t_r < b\).
Beyond this, there are other techniques we could use to better inform the probability of the transition occurring at different times. For example, we might have a good idea of their fitness level or average running speed from previous experience. In the absence of this information, there might be average times for the competitors available to further inform the problem and reduce the error in the estimate.
The Singular Value Decomposition
For cases where the matrix \(A\) is not full rank, the singular value decomposition (SVD) provides a more general framework for solving the least squares problem. The SVD decomposes the matrix \(A\) into a product of three matrices \(U\), \(\Sigma\), and \(V\):
\[ A = U \Sigma V^T \]
The matrices have the following special properties:
- Orthogonal Subspaces: \(U\) and \(V\) are orthogonal matrices, meaning that \(U^TU = I\) and \(V^TV = I\); that is, \(U^T = U^{-1}\) and \(V^T = V^{-1}\).
- Ordered Singular Values: \(\Sigma\) is a diagonal matrix with non-negative values on the diagonal, known as the singular values of \(A\). The singular values are ordered such that \(\sigma_1 \geq \sigma_2 \geq \ldots \geq \sigma_r > 0\). The number of non-zero singular values is equal to the rank of \(A\).
Suppose that we have a matrix \(A\) with \(\text{rank}(A) = r\) which maps from \(\mathbb{R}^m \rightarrow \mathbb{R}^n\). A fundamental way to view this mapping is as a composition of three linear transformations: a rotation \(V^T\), a scaling \(\Sigma\), and another rotation \(U\). The orthogonal matrix \(V\) has the property that all of its rows and columns are orthogonal to each other, and the vectors themselves are normalized to length \(1\). To see this property of an orthogonal matrix, consider that \(V^T V = I\) and \(V V^T = I\):
\[ \begin{align} Z = V^T V &= I \\ z_{ij} = \langle v_i, v_j \rangle &= \delta_{ij} \end{align} \]
Each element \(z_{ij}\) of the matrix \(V^T V\) is the dot product of the \(i\)th and \(j\)th columns of \(V\). The dot product of any vector with itself is \(1\), and the dot product of any two different columns is \(0\). From this we can see that all of the columns of \(V\) are orthonormal. The same property holds for \(U\).
By our definition of \(A\), the matrix \(V^T\) must accept a vector from \(\mathbb{R}^m\), and since it is square, it is an \(m \times m\) matrix. The matrix \(U\) must output a vector in \(\mathbb{R}^n\), and since it is square, it is an \(n \times n\) matrix. The matrix \(\Sigma\) must then be \(n \times m\) to map from \(\mathbb{R}^m\) to \(\mathbb{R}^n\).
In all its glory:
\[ \begin{aligned} A_{n \times m} &= U_{n \times n} \, \Sigma_{n \times m} \, V^T_{m \times m} \\ &= \left[ \begin{array}{ccc|ccc} \mathbf{u}_1 & \cdots & \mathbf{u}_r & \mathbf{u}_{r+1} & \cdots & \mathbf{u}_n \end{array} \right]_{n \times n} \left[ \begin{array}{ccc} \sigma_1 & & \\ & \ddots & \\ & & \sigma_r \\ 0 & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & 0 \end{array} \right]_{n \times m} \left[ \begin{array}{c} \mathbf{v}^T_1 \\ \vdots \\ \mathbf{v}^T_r \\ \hline \mathbf{v}^T_{r+1} \\ \vdots \\ \mathbf{v}^T_m \end{array} \right]_{m \times m} \end{aligned} \]
In this case the first \(r\) columns of \(U\) span the range (column space) of \(A\), while the rest of \(U\) spans its orthogonal complement, the null space of \(A^T\). The first \(r\) columns of \(V\) span the row space of \(A\), while the rest of \(V\) spans its orthogonal complement, the null space of \(A\). These are the four fundamental subspaces of the matrix \(A\); more information can be found at Wikipedia: SVD.
The matrices as shown above are for a rectangular \(A\) where \(n > m\), but the same properties hold for all \(n, m\). Some of the singular values \(\sigma_i\) may be zero, in which case the matrix \(A\) is not full rank.
Another way to decompose the SVD is to write it as a sum of outer products that are scaled by the diagonal matrix of singular values:
\[ A = \sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i^T \]
If \(\sigma_i > 0\) then \(v_i\) is not in the null space of \(A\) because \(A v_i = \sigma_i u_i\). If \(\sigma_i = 0\) then \(v_i\) is in the null space of \(A\) because \(A v_i = 0\).
The Pseudoinverse
Returning to the task of inverting \(Ax + \epsilon = b\), we can apply the SVD:
\[\begin{align} U \Sigma V^T x + \epsilon &= b \\ \Sigma V^T x &= U^T (b-\epsilon) \\ V \Sigma^{+} U^T (b-\epsilon) &= \hat{x} \\ A^+ (b-\epsilon) &= \hat{x} \end{align}\]
where \(A^+ = V \Sigma^{+} U^T\) is the pseudoinverse of \(A\). The pseudoinverse is a generalization of the matrix inverse for non-square or rank-deficient matrices. We recover an invertible matrix by removing all of the absent or zero singular values from \(\Sigma\) and inverting the rest, giving an \(r \times r\) diagonal matrix whose inverse is simply the reciprocal of each diagonal element.
\[ \left[ \begin{array}{ccc} \sigma_1 & & \\ & \ddots & \\ & & \sigma_r \\ 0 & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & 0 \end{array} \right]_{n \times m} \rightarrow \left[ \begin{array}{ccc} \sigma_1^{-1} & & \\ & \ddots & \\ & & \sigma_r^{-1} \\ \end{array} \right]_{r \times r}\]
Then \[\hat{x} = \sum_{i=1}^{N} \sigma_i^{-1} \, \mathbf{u}_i^T (b-\epsilon) \, \mathbf{v}_i\] is the solution to the least squares problem, where \(N = r\) recovers the full pseudoinverse. The sum can also be truncated by choosing \(N < r\). In practice, with real-world measurements, we end up with many singular values that are effectively \(0\) by virtue of being very small relative to the noise in the data and the largest singular value. The solution \(\hat{x}\) is a sum of \(v_i\) components that form an orthogonal basis, \(\hat{x} = \sum_i \beta_i v_i\), where \(\beta_i = \frac{u_i^T (b-\epsilon)}{\sigma_i}\). The small singular values blow up in size when inverted, so truncation is often necessary to avoid numerical instability and excessive amplification of the noise \(\epsilon\).
Least Squares
Least squares and matrix inversion are a classic starting point for understanding inverse theory. Suppose that we have model parameters \(\vec{x}\) and observed data \(\vec{b}\) that are related by a linear system of equations: \[Ax = b\] where \(A\) is a matrix of coefficients. In many cases, the system is overdetermined, meaning that there are more equations than unknowns. In this case, there is generally no exact solution to the system, and we must find the best solution that minimizes the error between the observed data \(\vec{b}\) and the predicted data \(A\vec{x}\). The simplest form of inversion that we can attempt is the least squares solution. In this case we discard the components of the observed data that lie outside the range of \(A\) — that is, in the null space of \(A^T\) — since no choice of parameters can reproduce them.
Example
Let \(A\) be a \(3 \times 2\) matrix and \(\vec{b}\) be a \(3 \times 1\) vector. The \(\vec{x}\) that we are trying to solve for is a \(2 \times 1\) vector. The system of equations is given by:
\[ A = \begin{bmatrix} \vec{a}_1 & \vec{a}_2 \end{bmatrix} \quad \vec{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \quad \vec{b} = \begin{bmatrix} b_1 \\ b_2 \\ b_3 \end{bmatrix} \]
In this case we have an overdetermined system with three equations, two unknowns, and three data samples. If the system of equations is full rank then we are trying to map from a 2D space to a 3D space: \(A: \mathbb{R}^2 \rightarrow \mathbb{R}^3\). In this case there is no exact solution to the system for any \(b\) that is not in the column space of \(A\).
Instead we can solve for the least squares solution \(\vec{x}_{LS}\) by minimizing the error between the observed data \(\vec{b}\) and the predicted data \(A\vec{x}\) from the forward model.
\[ \vec{x}_{LS} = \arg \min_{\vec{x}} ||A\vec{x} - \vec{b}||_2^2 \]
We want to find the argument that minimizes the function \(f(\vec{x}) = ||A\vec{x} - \vec{b}||_2^2\). By first order optimality conditions, the gradient of the function must be zero at the minimum.
\[ \begin{align} \nabla f(\vec{x}) &= 0 \\ \nabla ||A\vec{x} - \vec{b}||_2^2 &= 0 \\ \nabla (A\vec{x} - \vec{b})^T (A\vec{x} - \vec{b}) &= 0 \\ \nabla \left( \vec{x}^T A^T A \vec{x} - 2 \vec{b}^T A \vec{x} + \vec{b}^T \vec{b} \right) &= 0 \\ 2 A^T A \vec{x} - 2 A^T \vec{b} &= 0 \\ A^T A \vec{x} &= A^T \vec{b} \\ \vec{x}_{LS} &= (A^T A)^{-1} A^T \vec{b} \end{align} \]
These are known as the normal equations for the least squares solution. We note with caution that \(A^T A\) must be invertible for this solution to exist. If \(A\) is not full rank, then the matrix \(A^T A\) will not be invertible and other methods must be used.
We call the difference between the observed data and the predicted data the residual:
\[r = \vec{b} - A\vec{x}_{LS}\]
In these terms, what we are really minimizing is the sum of the squares of the residuals, \(||r||_2^2\) — the sum of the squared misfits in the data.
There is an altogether informative way to think about the minimization problem purely in terms of linear algebra and subspaces to derive the same normal equations.
Least Squares Visual
The range (or image) of \(A\) is the subspace of \(\mathbb{R}^3\) spanned by the columns of \(A\). This subspace has dimension \(2\) because there are only two columns in \(A\), with \(R(A) \subset \mathbb{R}^3\). The inaccessible parts of \(\mathbb{R}^3\) lie in the orthogonal complement of \(R(A)\), denoted \(R(A)^\perp\). Recalling that \(R(A)^\perp = N(A^T)\), we can diagram the least squares solution as the choice of \(x\) that leaves the error vector \(r\) entirely in the orthogonal complement of \(R(A)\).
As seen in the figure, the residual \(r\) is perpendicular to \(R(A)\), and in particular to the prediction \(A x_{LS}\): the projection of \(r\) onto \(R(A)\) is zero. Since \(r\) lies in the null space of \(A^T\), we have \(A^T r = 0\).
\[ \begin{align} A^T \left ( Ax_{LS} - b \right ) &= 0\\ A^T A x_{LS} &= A^T b \\ \end {align} \]
So we recover the normal equations without using any of the machinery of calculus.
For a review on the four fundamental subspaces of a matrix see the UBC Math 307 notes on the topic: Math 307