Lecture 4

Regularization and the Conjugate Gradient Methods

Optimization

Inverse Theory

Python

Tikhonov regularization is a common technique used in inverse theory to stabilize ill-posed problems. In this lecture, we derive the Tikhonov regularization technique, we also have a look at a least squares solution that does not require the computation of the full SVD of the matrix $A$, using the conjugate gradient method.

Author

Simon Ghyselincks

Published

September 20, 2024

Tikhnov Regularization

We have looked at the least squares formulation for solving inverse problems:

\[ \min \frac{1}{2} \|A x - b\|^2 \]

where $A \in \mathbb R^{m \times n}$ is a linear operator, $x \in \mathbb R^n$ is the unknown model, and $b \in \mathbb R^m$ is the data.

The least squares problem is often ill-posed, meaning that the solution is not unique or stable. If there are more unknowns than equations, such as the case when $n > m$, then the problem is underdetermined and there are infinitely many solutions.

We can return to unique solutions by adding a regularization term to the selection of the $x$ that we want to minimize. The Tikhonov regularization technique adds a penalty term to the least squares problem:

\[ \min \frac{1}{2} \|A x - b\|^2 + \frac{1}{2} \lambda \|Lx\|^2 \]

where $L \in \mathbb R^{n \times n}$ is a regularization matrix. The regularization matrix $L$ is often chosen to be the identity matrix, but other choices are possible.

Uniqueness

To check the uniqueness of the solution, we can rewrite the problem as a quadratic form:

\[ \min \frac{1}{2} x^T A^T A x - b^T A x + \frac{1}{2} \lambda x^T L^T L x \] \[ = \min \frac{1}{2} x^T H x - b^T A x + \frac{1}{2}\|b\|^2\]

where $H = A^T A + \lambda L^T L$ is the Hessian matrix which is symmetric and positive semi-definite by spectral theorem. If we choose an appropriate $\lambda$, then the Hessian matrix is positive definite and the problem is well-posed. In the case where $L=I$, the Hessian becomes full rank for $\lambda > 0$ and the problem is well-posed. The quality that $H \succ 0$ means that the matrix is invertible.

Solution

The unique solution is given by by the first order optimatility condition:

\[ \begin{align} (A^T A + \lambda L^T L) \mathbf{x}_{\text{RLS}} - A^T b&= 0 \\ \mathbf{x}_{\text{RLS}} &= (A^T A + \lambda L^T L)^{-1} A^T b \end{align} \]

SVD Decomposition

The solution can be written in terms of the singular value decomposition of $A$, and with the assumption that $L=I$:

\[ \begin{align} A &= U \Sigma V^T \\ A^T A &= V \Sigma^T \Sigma V^T \\ \mathbf{x}_{\text{RLS}} &= \left( V \Sigma^2 V^T + \lambda I \right)^{-1} V \Sigma^T U^T b \\ &= \left( V \Sigma^2 V^T + \lambda I V V^T \right)^{-1} V \Sigma^T U^T b\\ &= V \left( \Sigma^2 + \lambda I \right)^{-1} \Sigma^T U^T b\\ &= V \mathbf{Diag}\left( \frac{\sigma_i}{\sigma_i^2 + \lambda} \right) U^T b\\ &= \sum _i ^ n \frac{\sigma_i}{\sigma_i^2 + \lambda} \langle u_i, b \rangle v_i \end{align} \]

This form is more readily comparable to some of the other methods that we have see so far, which are presented in the table below:

Comparison of Least Squares Methods

Method	Solution
Tikhonov	$x_{\text{RLS}} = \sum _i ^ n \frac{\sigma_i}{\sigma_i^2 + \lambda} \langle u_i, b \rangle v_i$
Thresholded SVD	$x_{\text{TSVD}} = \sum _i ^ n h(\sigma_i) \langle u_i, b \rangle v_i$
Gradient Flow	$x_{\text{SDF}} = \sum _i ^ n \frac{\exp(-\sigma_i^2 t) - 1}{\sigma_i} \langle u_i, b \rangle v_i$

As we can see all three methods have a similar form and offer some mechanism for controlling the noise induced by the small singular values of $A$.

Show the code

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def generate_ill_conditioned_matrix(m, n, condition_number):   
    # Generate random orthogonal matrices U and V
    U, _ = np.linalg.qr(np.random.randn(m, m))
    V, _ = np.linalg.qr(np.random.randn(n, n))
    
    sigma = np.linspace(1, 1/condition_number, min(m, n))    
    Sigma = np.diag(sigma)    
    A = U @ Sigma @ V[:min(m, n), :]
    
    return A, sigma

# Seed for reproducibility
np.random.seed(4)
A, S = generate_ill_conditioned_matrix(8, 24, 1e3)

# Create a vector b of size 5 with random values
b = np.random.randn(8)

# Compute the SVD of A
U, S, Vt = np.linalg.svd(A, full_matrices=False)
V = Vt.T
U = U  # Already in proper shape

# Number of singular values
n = len(S)

# Define parameters for each method
# Gradient Flow
t_values = np.linspace(0, 0.6, 100)

# Tikhonov Regularization
lambda_values = np.linspace(1e-4, 1, 100)

# Thresholded SVD
threshold_values = np.linspace(0, max(S), 100)

# Compute scaling factors for each method
# Gradient Flow Scaling
def gradient_flow_scaling(sigma, t):
    return (1 - np.exp(-sigma**2 * t)) / sigma

gradient_scalings = np.array([gradient_flow_scaling(s, t_values) for s in S])

# Tikhonov Scaling
def tikhonov_scaling(sigma, lambd):
    return sigma / (sigma**2 + lambd)

tikhonov_scalings = np.array([tikhonov_scaling(s, lambda_values) for s in S])

# Thresholded SVD Scaling
def tsvd_scaling(sigma, threshold):
    return np.where(sigma >= threshold, 1/sigma, 0)

tsvd_scalings = np.array([tsvd_scaling(s, threshold_values) for s in S])

# Initialize the plot with 3 subplots
fig, axes = plt.subplots(3, 1, figsize=(5, 15))

# Define a color palette
palette = sns.color_palette("husl", n)

# Plot Gradient Flow
ax = axes[0]
for i in range(n):
    ax.plot(t_values, gradient_scalings[i], color=palette[i], linewidth=2, label=f'$1/\sigma_{i}$' )
    ax.axhline(y=1/S[i], color=palette[i], linestyle='--', linewidth=1)
ax.set_yscale('log')
ax.set_xlabel('Time (t)', fontsize=14)
ax.set_ylabel('Scaling Factor', fontsize=14)
ax.set_title('Gradient Flow', fontsize=16)
ax.legend()
ax.grid(True)

# Plot Tikhonov Regularization
ax = axes[1]
for i in range(n):
    ax.plot(lambda_values, tikhonov_scalings[i], color=palette[i], linewidth=2, label=f'$1/\sigma_{i}$' )
    ax.axhline(y=1/S[i], color=palette[i], linestyle='--', linewidth=1)
ax.set_yscale('log')
ax.set_xlabel('Regularization Parameter (λ)', fontsize=14)
ax.set_ylabel('Scaling Factor', fontsize=14)
ax.set_title('Tikhonov Regularization', fontsize=16)
ax.legend()
ax.grid(True)

# Plot Thresholded SVD
ax = axes[2]
for i in range(n):
    ax.plot(threshold_values, tsvd_scalings[i], color=palette[i], linewidth=2, label=f'$1/\sigma_{i}$')
    ax.axhline(y=1/S[i], color=palette[i], linestyle='--', linewidth=1)
ax.set_yscale('log')
ax.set_xlabel('Threshold (τ)', fontsize=14)
ax.set_ylabel('Scaling Factor', fontsize=14)
ax.set_title('Thresholded SVD', fontsize=16)
ax.legend()
ax.grid(True)

# Adjust layout and add a legend
plt.tight_layout()
plt.show()

Evolution of scaling factors for three different methods

Solving Least Squares Using Conjugate Gradient

A detailed explanation of this method can be found at Wikipedia

Conjugate Vectors Definition

A set of vectors $\{ p_1, p_2, \ldots, p_n \}$ is said to be conjugate with respect to a matrix $A$ if:

\[ \langle p_i, A p_j \rangle = 0 \quad \text{for all } i \neq j \]

This is a generalization of the concept of orthoganality to non-symmetric matrices.

Standard Orthogonality: When $ A = I $ (the identity matrix), the definition reduces to the standard concept of orthogonality. For a symmetric $A$ we also have an orthogonal decomposition of eigenvectors by spectral theorem.

Back to the problem of least squares, we can express the solution $ x $ as a linear combination of conjugate vectors:

\[ x = x_0 + \sum_{i=1}^n \alpha_i p_i \]

where:

$x_0$ is an initial guess (can be zero).
$\alpha_i$ are scalar coefficients.
$p_i$ are conjugate vectors with respect to $A$.

To recover the coefficients of $\alpha_i$ we can use a projection in the weighted space of $A$:

\[ \begin{align} A x_0 + \sum_{i=1}^n \alpha_i A p_i &= b\\ r &= b - A x_0\\ \sum_{i=1}^n \alpha_i A p_i &= r\\ \langle p_i, \sum_{i=1}^n \alpha_i A p_i \rangle &= \langle p_i, r \rangle\\ \alpha_i \langle p_i, A p_i \rangle &= \langle p_i, r \rangle\\ \alpha_i &= \frac{\langle p_i, r \rangle}{\langle p_i, A p_i \rangle} \end{align} \] In the case where $x_0$ is zero, then this reduces to \[ \alpha_i = \frac{\langle p_i, b \rangle}{\langle p_i, A p_i \rangle} \]

Algorithm Steps

Initialize:

$x = x_0$
$r_0 = b - A x_0$
$p_0 = r_0$

For $i = 0,1, 2, \ldots$:

Compute $\alpha_i$:

\[ \alpha_i = \frac{\langle r_i, r_i \rangle}{\langle p_i, A p_i \rangle} \]
Update Solution $x$:

\[ x_{i+1} = x_{i} + \alpha_i p_i \]
Update Residual $r$:

\[ r_{i+1} = r_{i} - \alpha_i A p_i \]
Check for Convergence:
- If $\| r_{i+1} \|$ is small enough, stop.
Compute $\beta_i$:

\[ \beta_i = \frac{\langle r_{i+1}, r_{i+1}\rangle}{\langle r_i,r_i \rangle} \]
Update Conjugate Direction $p_{i+1}$:

\[ p_{i+1} = r_{i+1} + \beta_i p_i \]

The method can be seen better if we trace through the minimization problem for fixed $x$ and with variable $\alpha$:

\[ \begin{align} & \min \frac{1}{2} \|A (x+\alpha p) - b\|^2 \\ &= \frac{1}{2}r^T r + \alpha \langle r, A p \rangle + \frac{1}{2} \alpha^2 \langle p, A^T A p \rangle \\ 0 &= \langle r, A p \rangle + \alpha \langle p, A^T A p \rangle \\ \alpha &= -\frac{\langle r, A p \rangle}{\|A p\|^2} \end{align} \]

But we can also trace this through using the expansion of lest squares and removing the $\|b\|^2$ term:

\[ \begin{align} & \min \frac{1}{2} \tilde x^T A x - \tilde x^T b \\ &= \frac{1}{2} \left( x^T A x + 2 \alpha x^T A p + \alpha^2 p^T A p \right) - x^T b - \alpha p^T b\\ 0&= x^TAp + \alpha p^T A p - p^T b \\ \alpha &= \frac{p^T (Ax-b)}{p^T A p} \end {align} \]