$$
$$
Gradient Descent with Score Function
The classic inverse problem is defined as
where
The probability of deviation of the observed data from the forward model in this case is given by:
Without any prior information about the error, it is difficult to estimate the covariance matrix
To model the probability distribution of the inverse problem parameters
Where the partion function
Note the partition function is required to make the probability distribution integrate to
Maximum A Posteriori Estimation
The goal of maximum a posteriori (MAP) estimation is to find the most likely
Since
The logarithm is a monotonic function, we can maximize the log-likelihood instead of the likelihood with no loss of generality.
We have looked at methods previously of how to differentiate the forward operator
Using gradient descent, we can update the model parameters
Score Function
It is the negative gradient of the potential function
Score has a physical intution connected to energy potentials and fields. In physics, the electric field
Simalarly, the score function is the negative gradient of the potential function
Example: 2D Gaussian Distribution
Consider a 2D Gaussian distribution with zero mean and covariance
Function | Expression |
---|---|
Probability Distribution | |
Potential Function | |
Score Function |
In regions of high probability density, the potential function is low becuase the relation
Visualization
Below is a visualization of the probability density function (PDF), potential function, and score function of a 2D Gaussian distribution.
Show the code
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Make plots for a 2D Gaussian distribution heatmap, potential function, and score function
# Define the 2D Gaussian distribution
def gaussian_pdf(x, y, sigma=1):
return np.exp(-0.5*(x**2 + y**2)/sigma**2)/(2*np.pi*sigma**2)
# Define the potential function
def potential_function(x, y, sigma=1):
return 0.5*(x**2 + y**2)/sigma**2 - np.log(2*np.pi*sigma**2)
# Define the score function
def score_function(x, y, sigma=1):
return -np.array([x, y])/sigma**2
# Create a grid of points
= np.linspace(-3, 3, 500)
x = np.linspace(-3, 3, 500)
y = np.meshgrid(x, y)
X, Y
# Compute the PDF, potential function, and score function
= gaussian_pdf(X, Y)
pdf = potential_function(X, Y)
potential = score_function(X, Y)
score
# Plot the Probability Distribution with a colorbar
=(4, 3))
plt.figure(figsize= plt.imshow(pdf, cmap='viridis', extent=[-3, 3, -3, 3])
im 'off')
plt.axis(=0.8, label="Density")
plt.colorbar(im, shrink
plt.show()
# Plot the Potential Function with a colorbar
=(4, 3))
plt.figure(figsize= plt.imshow(potential, cmap='viridis', extent=[-3, 3, -3, 3])
im 'off')
plt.axis(=0.8, label="Potential")
plt.colorbar(im, shrink
plt.show()
# Downsample the grid for quiver plotting
= 50 # Downsample by taking every 50th point
step = X[::step, ::step]
X_downsampled = Y[::step, ::step]
Y_downsampled = score_function(X_downsampled, Y_downsampled)
score_downsampled
# Plot the Score Function as a quiver plot over the PDF
=(4, 3))
plt.figure(figsize='viridis', extent=[-3, 3, -3, 3])
plt.imshow(pdf, cmap
plt.quiver(
X_downsampled, Y_downsampled, 0], score_downsampled[1],
score_downsampled[='black'
color
)'off')
plt.axis(
# Save to file
"imgs/score_function.png", bbox_inches='tight')
plt.savefig(
plt.show()
Maximum Likelihood Estimate from Samples
A common problem is estimating a probability distribution
Since some
The Maximum Likelihood Estimation (MLE) is the process of finding the parameters
However we again run into the problem of the partition function
Minimizing with Gradient Descent
The gradient of the MLE objective with respect to
The left side term can be further reduced by using the definition of the partion function
We estimate the value of
An optimal point is reached by first order optimality conditions, where the gradient is zero. The MLE objective is minimized when the synthesis and analysis terms are equal. This is known as score matching and is a method for estimating the parameters of a probability distribution from samples. The synthesis
As before gradient descent can be used to update the model parameters
An Application of Score: Langevin Dynamics
An example of the usage of a learned score function is in Langevin Dynamics. Langevin Dynamics is a method for sampling from a probability distribution
The score function pushes samples towards regions of high probability density, since it is a vector that points in the direction of maximum increase of the probability distribution. The noise term
Where
The score points in the same direciton as the gradient of the probability distribution, so at each time step, the score term moves the sample to a higher probability region. The stochastic term adds noise to the sample, providing randomness to the process.
Code Example
Below is an example of Langevin Dynamics applied to a 2D Gaussian distribution. The score function is used to push the samples towards the center of the distribution, while the stochastic element adds noise and randomization to the process. The animation show an intial uniform sampling grid of points converging to the shape of the Gaussian distribution.
Show the code
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import seaborn as sns
# Ensure the Pillow writer is available
from matplotlib.animation import PillowWriter
# Define the 2D Gaussian distribution
def gaussian_pdf(x, y, sigma=1):
return np.exp(-0.5*(x**2 + y**2)/sigma**2)/(2*np.pi*sigma**2)
# Define the potential function
def potential_function(x, y, sigma=1):
return 0.5*(x**2 + y**2)/sigma**2 - np.log(2*np.pi*sigma**2)
# Define the score function
def score_function(x, y, sigma=1):
return -np.array([x, y])/sigma**2
# Define the Langevin Dynamics update
def langevin_dynamics(samples, sigma=1, dt=0.05):
= np.random.normal(0, 1, samples.shape)
z = score_function(samples[:, 0], samples[:, 1], sigma)
score = samples + dt * score.T + np.sqrt(2 * dt) * z
samples return samples
# Create a grid of points for the contour plot
= np.linspace(-3, 3, 500)
x = np.linspace(-3, 3, 500)
y = np.meshgrid(x, y)
X, Y
# Compute the PDF for contour plotting
= .5
sigma = gaussian_pdf(X, Y, sigma)
pdf
# Initialize samples on a 5x5 grid
= 10
grid_size = np.linspace(-2.5, 2.5, grid_size)
grid_range = np.array([[x, y] for x in grid_range for y in grid_range])
initial_samples = initial_samples.copy()
samples
# Set up the figure and axis
= plt.subplots(figsize=(8, 6))
fig, ax
# Plot the initial contour
= ax.contourf(X, Y, pdf, levels=50, cmap='viridis')
contour = ax.scatter(samples[:, 0], samples[:, 1], color='red', s=30, label='Samples')
scatter
# Add a legend
ax.legend()
# Function to update the animation at each frame
def update(frame):
global samples
= langevin_dynamics(samples, dt=0.002, sigma=sigma)
samples
scatter.set_offsets(samples)f'Langevin Dynamics Iteration: {frame+1}')
ax.set_title(return scatter,
# Create the animation
= animation.FuncAnimation(
ani =400, interval=200, blit=True, repeat=False
fig, update, frames
)
# Save the animation as a GIF
"imgs/langevin_dynamics.gif", writer="imagemagick", fps=10)
ani.save(
# Display the animation inline (if supported)
plt.close(fig)
MAP Estimation with General Gaussian
Revisitng the MAP estimation problem, we can consider a more general prior
We now make a choice of which
Finding the Gradient
Left Term: To find the min using gradient descent we take the gradient of each term with respect to
This can be rewritten as:
Applying this to the left term we get:
Right Term: The gradient of the second term is found by reformulating it as a trace of a new matrix
The matrix cookbook (Petersen and Pedersen 2012) formula (100) gives the gradient of the trace of a matrix product:
Applying this we get:
Combined Result:
Combining the two terms we get the gradient of the MLE objective with respect to
Solving for where the gradient is zero, we find the optimal value of
This is the maximum likelihood estimate of the covariance matrix
Unfortunately, estimating parameters can be very difficult to do for non-Gaussian
Conclusion
MAP estimation on an inverse problem may require the use of a prior distribution
If the prior is not known, an MLE estimate can be made to find the parameters