Lecture 10

MAP, MLE, and Score Function

In this mathematical lecture, some of the foundational principles of Bayesian inverse problems and their statistical interpretation are discussed. A set of computational tools are shown that aid in finding the solution to some of these problems.
Author

Simon Ghyselincks

Published

November 22, 2024

Gradient Descent with Score Function

Definitions
Component Description Dimensions
F(x) Forward operator RnRm
x Model parameters Rn
b Observed data Rm
ϵ Noise Rm
π(x) Probability distribution
ϕ(x,θ) Potential function RnR
θ Learnable Parameters Rp
Z(θ) Partition Function R
s(x,θ) Score Function Rn

The classic inverse problem is defined as

b=F(x)+ϵ

where F(x) is the forward operator, b is the observed data, and ϵ represents the noise in the measurement or process. We often assume that ϵ is Gaussian with zero mean and covariance matrix Σ.

ϵN(0,Σ)

The probability of deviation of the observed data from the forward model in this case is given by: π(ϵ)=1(2π)m/2|Σ|1/2exp(12ϵΣ1ϵ)

Without any prior information about the error, it is difficult to estimate the covariance matrix Σ. For the purpose of this analysis we can assume that it is a normal distriution with zero mean and a diagonal σ2I covariance matrix. The likelihood function simplifies to:

π(ϵ)=1(2πσ2)m/2exp(12σ2bF(x)2)

To model the probability distribution of the inverse problem parameters x, we introduce a prior distribution π(x). To ensure positivity of π(x) over the entire domain and proper normalization, we define it using a potential function ϕ(x,θ): π(x;θ)=eϕ(x,θ)Z(θ)

Where the partion function Z(θ) is given by:

Z(θ)=Ωeϕ(x,θ)dx

Note the partition function is required to make the probability distribution integrate to 1. The exponential operator on the potential ensures that all π(x) values are positve since eϕ>0 for all zR. In practice, it is often intractable to directly compute the partition function when updating the model parameters θ for distributions that are more complex than a Gaussian.

ϕ(x,θ):RnR maps x to a scalar value, and θ are the parameters of the model. For example if we are modeling a Gaussian, the parameters might include the covariance matrix Σ. It has a physical interpretation as an energy of a system, where ϕ values correspond to low probability density regions. For this reason it is often called the energy function in physics-inspired models.

Maximum A Posteriori Estimation

The goal of maximum a posteriori (MAP) estimation is to find the most likely x given the observed data b and model parameters θ, maximize the posterior probability π(x|b;θ):

maxxπ(x|b;θ)=maxxπ(b|x;θ)π(x;θ)π(b)maxxπ(x|b;θ)Posterior=maxxπ(b|x;θ)Likelihoodπ(x;θ)Prior

Since π(b) is independent of x, it does not affect the maximization problem. Substituting the likelihood and prior distributions, we have:

=maxx1(2πσ2)m/2exp(12σ2bF(x)2)1Z(θ)eϕ(x,θ)

The logarithm is a monotonic function, we can maximize the log-likelihood instead of the likelihood with no loss of generality. maxxπ(x)=maxxlog(π(x)). Intuitively, since the logarithim is always increasing in output, log(z)>log(y) implies z>y. In addition the product of two exponentials is the same as the sum of the exponents, and the maximum of a function is the same as the minimum of the negative of the function. This allows us to rewrite the log-likelihood as:

Minimization Objective

maxxlogπ(x|b;θ)=minx(12σ2bF(x)2+ϕ(x,θ))

We have looked at methods previously of how to differentiate the forward operator F and perform gradient descent. We take the gradient with respect to x to find the minimum of the function.

g=x(12σ2bF(x)2ϕ(x,θ))=1σ2Fx(F(x)b)xϕ(x,θ)=1σ2JT(x)(F(x)b)+s(x,θ)

Using gradient descent, we can update the model parameters θ by taking steps in away from the direction of the gradient g:

xk+1=xkαg

Score Function

s(x;θ) is known as the score function of π(x;theta). s(x,θ):=xlog(π(x))=xϕ(x,θ)+C

It is the negative gradient of the potential function ϕ(x,θ) with respect to x. The score function is a generalization of the gradient of the log-likelihood function, and is described in more detail in Schervish’s “Theory of Statistics” ().

Score has a physical intution connected to energy potentials and fields. In physics, the electric field E is the negative gradient of the electric potential V: E=xV(x)

Simalarly, the score function is the negative gradient of the potential function ϕ(x,θ) in the case where π(x)=eϕ(x,θ). The score function is the direction in which the probability distribution is most likely to change.

Example: 2D Gaussian Distribution

Consider a 2D Gaussian distribution with zero mean and covariance σ2I:

Function Expression
Probability Distribution π(x)=12πσ2exp(12σ2x2)
Potential Function ϕ(x,θ)=12σ2x2log(2πσ2)
Score Function xϕ(x,θ)=(xσ2x)=xσ2

In regions of high probability density, the potential function is low becuase the relation π(x)=eϕ(x,θ) is monotonic in ϕ(x,θ). The score funtion is always pointing in the local direction of the largest directional derivative of the probability distribution π.

Visualization

Below is a visualization of the probability density function (PDF), potential function, and score function of a 2D Gaussian distribution.

Show the code
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Make plots for a 2D Gaussian distribution heatmap, potential function, and score function

# Define the 2D Gaussian distribution
def gaussian_pdf(x, y, sigma=1):
    return np.exp(-0.5*(x**2 + y**2)/sigma**2)/(2*np.pi*sigma**2)

# Define the potential function
def potential_function(x, y, sigma=1):
    return 0.5*(x**2 + y**2)/sigma**2 - np.log(2*np.pi*sigma**2)

# Define the score function
def score_function(x, y, sigma=1):
    return -np.array([x, y])/sigma**2

# Create a grid of points
x = np.linspace(-3, 3, 500)
y = np.linspace(-3, 3, 500)
X, Y = np.meshgrid(x, y)

# Compute the PDF, potential function, and score function
pdf = gaussian_pdf(X, Y)
potential = potential_function(X, Y)
score = score_function(X, Y)

# Plot the Probability Distribution with a colorbar
plt.figure(figsize=(4, 3))
im = plt.imshow(pdf, cmap='viridis', extent=[-3, 3, -3, 3])
plt.axis('off')
plt.colorbar(im, shrink=0.8, label="Density")
plt.show()

# Plot the Potential Function with a colorbar
plt.figure(figsize=(4, 3))
im = plt.imshow(potential, cmap='viridis', extent=[-3, 3, -3, 3])
plt.axis('off')
plt.colorbar(im, shrink=0.8, label="Potential")
plt.show()

# Downsample the grid for quiver plotting
step = 50  # Downsample by taking every 50th point
X_downsampled = X[::step, ::step]
Y_downsampled = Y[::step, ::step]
score_downsampled = score_function(X_downsampled, Y_downsampled)

# Plot the Score Function as a quiver plot over the PDF
plt.figure(figsize=(4, 3))
plt.imshow(pdf, cmap='viridis', extent=[-3, 3, -3, 3])
plt.quiver(
    X_downsampled, Y_downsampled, 
    score_downsampled[0], score_downsampled[1], 
    color='black'
)
plt.axis('off')

# Save to file
plt.savefig("imgs/score_function.png", bbox_inches='tight')

plt.show()
(a) Probability Distribution
(b) Potential Function
(c) Score Function
Figure 1: PDF, Potential Function, and Score Function of a 2D Gaussian Normal Distribution

Maximum Likelihood Estimate from Samples

A common problem is estimating a probability distribution π(x) based on a set of empirical samples {x1,x2,...,xN}. Often in statistical analysis, we are working with an unknown π and attempting to make our best estimate from sampled data. If we assume that the drawn samples are independent and identically distributed (i.i.d.), then the likelihood of the samples is the product of the likelihood of each sample. Recalling that Z(θ)=Ωπ(x)dx, the likelihood of the samples is:

π(x1,x2,...,xN)=i=1Nπ(xi)=π(x1)π(x2)...π(xN)=1Z(θ)exp(ϕ(x1,θ))1Z(θ)exp(ϕ(x2,θ))...1Z(θ)exp(ϕ(xN,θ))=1[Z(θ)]Nexp(i=1Nϕ(xi,θ))

Since some x are observed, the challenge is to find the potential function ϕ(x,θ) that maximizes the likelihood of the samples, given the model parameters θ and a fixed family of ϕ(,;θ) functions.

The Maximum Likelihood Estimation (MLE) is the process of finding the parameters θ that maximize the likelihood of having observed the samples. Unlike the MAP estimate, there is no posterior and prior distribution. So in this case π(x|θ) is being directly maximized. This is equivalent to minimizing the negative log-likelihood as before:

MLE=argmaxθ1[Z(θ)]Nexp(i=1Nϕ(xi,θ))=argminθNlog(Z(θ))+i=1Nϕ(xi,θ)=argminθlog(Z(θ))+1Ni=1Nϕ(xi,θ)

However we again run into the problem of the partition function Z(θ) which for most distributions in higher dimensions is intractable to compute. For example we may be trying to solve an integral in 100 dimensions with no analytical solution.

x1x2...x100eϕ(x,θ)dx1dx2...dx100

Minimizing with Gradient Descent

Minimization Objective

MLE=argminθ(log(Z(θ))+1Ni=1Nϕ(xi,θ))

The gradient of the MLE objective with respect to θ is:

g=θ(log(Z(θ))+1Ni=1Nϕ(xi,θ))=θlog(Z(θ))+1Ni=1Nθϕ(xi,θ)

The left side term can be further reduced by using the definition of the partion function Z(θ)=Ωeϕ(x,θ)dx, the probability distribution πθ(x)=eϕ(x,θ):

θlog(Z(θ))=1Z(θ)θΩeϕ(x,θ)dx=Ω1Z(θ)θeϕ(x,θ)dx=Ω1Z(θ)eϕ(x,θ)θϕ(x,θ)dx=Ωπ(x)θϕ(x,θ)dx=Exπθ(x)[θϕ(x,θ)]1Mi=1Mθϕ(xi,θ)

We estimate the value of θlog(Z(θ)) by taking the expectation value of the score function over the samples. This is a Monte Carlo approximation of the true integral using the available i.i.d. samples {x1,x2,...,xN}. The gradient of the MLE objective is then:

Score Matching Gradient

g=1Mi=1Mθϕ(x~i,θ)Synthesis+1Ni=1Nθϕ(xi,θ)Analysis

An optimal point is reached by first order optimality conditions, where the gradient is zero. The MLE objective is minimized when the synthesis and analysis terms are equal. This is known as score matching and is a method for estimating the parameters of a probability distribution from samples. The synthesis x~i terms are drawn randomly from the proposed score ϕ(x~i,θ) , while the analysis xi terms are taken over the samples {x1,x2,...,xN}.

As before gradient descent can be used to update the model parameters θ:

θk+1=θkαg

An Application of Score: Langevin Dynamics

An example of the usage of a learned score function is in Langevin Dynamics. Langevin Dynamics is a method for sampling from a probability distribution π(x) by simulating a stochastic differential equation (SDE) that converges to the distribution. The SDE is given by:

dx=xϕ(x,θ)dt+2dW

The score function pushes samples towards regions of high probability density, since it is a vector that points in the direction of maximum increase of the probability distribution. The noise term dW is a Wiener process, which is a continuous-time stochastic process that is normally distributed with mean zero and variance dt. The Langevin Dynamics algorithm is a discretized version of the SDE:

xk+1=xkΔtxϕ(xk,θ)Score Term+2ΔtzkStochastic Term

Where zkN(0,1) is a random sample from a normal distribution with zero mean and unit variance.

The score points in the same direciton as the gradient of the probability distribution, so at each time step, the score term moves the sample to a higher probability region. The stochastic term adds noise to the sample, providing randomness to the process.

Code Example

Below is an example of Langevin Dynamics applied to a 2D Gaussian distribution. The score function is used to push the samples towards the center of the distribution, while the stochastic element adds noise and randomization to the process. The animation show an intial uniform sampling grid of points converging to the shape of the Gaussian distribution.

Show the code
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import seaborn as sns

# Ensure the Pillow writer is available
from matplotlib.animation import PillowWriter

# Define the 2D Gaussian distribution
def gaussian_pdf(x, y, sigma=1):
    return np.exp(-0.5*(x**2 + y**2)/sigma**2)/(2*np.pi*sigma**2)

# Define the potential function
def potential_function(x, y, sigma=1):
    return 0.5*(x**2 + y**2)/sigma**2 - np.log(2*np.pi*sigma**2)

# Define the score function
def score_function(x, y, sigma=1):
    return -np.array([x, y])/sigma**2

# Define the Langevin Dynamics update
def langevin_dynamics(samples, sigma=1, dt=0.05):
    z = np.random.normal(0, 1, samples.shape)
    score = score_function(samples[:, 0], samples[:, 1], sigma)
    samples = samples + dt * score.T + np.sqrt(2 * dt) * z
    return samples

# Create a grid of points for the contour plot
x = np.linspace(-3, 3, 500)
y = np.linspace(-3, 3, 500)
X, Y = np.meshgrid(x, y)

# Compute the PDF for contour plotting
sigma = .5
pdf = gaussian_pdf(X, Y, sigma)

# Initialize samples on a 5x5 grid
grid_size = 10
grid_range = np.linspace(-2.5, 2.5, grid_size)
initial_samples = np.array([[x, y] for x in grid_range for y in grid_range])
samples = initial_samples.copy()

# Set up the figure and axis
fig, ax = plt.subplots(figsize=(8, 6))

# Plot the initial contour
contour = ax.contourf(X, Y, pdf, levels=50, cmap='viridis')
scatter = ax.scatter(samples[:, 0], samples[:, 1], color='red', s=30, label='Samples')

# Add a legend
ax.legend()

# Function to update the animation at each frame
def update(frame):
    global samples
    samples = langevin_dynamics(samples, dt=0.002, sigma=sigma)
    scatter.set_offsets(samples)
    ax.set_title(f'Langevin Dynamics Iteration: {frame+1}')
    return scatter,

# Create the animation
ani = animation.FuncAnimation(
    fig, update, frames=400, interval=200, blit=True, repeat=False
)

# Save the animation as a GIF
ani.save("imgs/langevin_dynamics.gif", writer="imagemagick", fps=10)

# Display the animation inline (if supported)
plt.close(fig)
Figure 2

Langevin Dynamics to a 2d Gaussian

Langevin Dynamics to a 2d Gaussian

MAP Estimation with General Gaussian

Revisitng the MAP estimation problem, we can consider a more general prior π(x) with mean μ=0 and covariance Σ:

π(b|x)=1(2πσ2)m/2exp(12σ2bF(x)2)πθ(x)=1(2π)n/2|Σ|1/2exp(12xΣ1x)MAP=minx(12σ2bF(x)2+12xΣ1x)

We now make a choice of which piθ(x) to use, as before this is a MLE based on the available empirical samples. For the purpose of finding the gradient, let the minimization parameters be θ=Σ1, since the true Σ is positive definite, it can be found by inverting this result. The partition function Z(θ) is known for the case of a Gaussian distribution, and so the MLE objective is:

π(x)=maxθΠi=1Nπθ(xi)=maxθ1Z(θ)Nexp(i=1N12xiΣ1xi)=minθNlog(Z(θ))+12i=1NxiΣ1xi=minθlog((2π)n/2|Σ|1/2)+12Ni=1NxiΣ1xi=minθ12log(|Σ|)+12Ni=1NxiΣ1xi

Finding the Gradient

Left Term: To find the min using gradient descent we take the gradient of each term with respect to Σ1. The gradient of a log determinant can be found in the matrix cookbook () formula (57): Alog(|A|)=(A1)

This can be rewritten as:

Alog(|A1|)=A

Applying this to the left term we get:

Σ1(12log(|Σ|))=12Σ

Right Term: The gradient of the second term is found by reformulating it as a trace of a new matrix X=xixi:

xjΣ1xj=trace(xjΣ1xj)=trace(Σ1xjxj)=trace(Σ1X)

The matrix cookbook () formula (100) gives the gradient of the trace of a matrix product:

Xtrace(XB)=B

Applying this we get:

Σ1(12NxiΣ1xi)=12NX=12Nxixi

Combined Result:

Combining the two terms we get the gradient of the MLE objective with respect to Σ1: g=Σ1(12log(|Σ|1)+12Ni=1NxiΣ1xi)=12Σ12Ni=1Nxixi

Solving for where the gradient is zero, we find the optimal value of Σ, since Σ=Σ:

Σ=1Ni=1Nxixi

This is the maximum likelihood estimate of the covariance matrix Σ given the samples {x1,x2,...,xN}, which can be interpreted as the expectation value of the outer product of the samples. The estimated covariance matrix is the actual true covariance matrix of the sampled data. This is a common result in statistics, where the sample mean is the MLE of the true mean of the distribution.

Unfortunately, estimating parameters can be very difficult to do for non-Gaussian πθ due to the partition function Z(θ) being intractable.

Conclusion

MAP estimation on an inverse problem may require the use of a prior distribution π(x) to regularize the solution. The potential function ϕ(x,θ) is used to define the probability distribution π(x;θ), and the score function s(x,θ) is the negative gradient of the potential function.

If the prior is not known, an MLE estimate can be made to find the parameters θ that maximize the likelihood of the samples based on an assumed parameterized model. The score matching gradient is used to estimate the gradient of the MLE objective with respect to θ.

References

Petersen, K. B., and M. S. Pedersen. 2012. “The Matrix Cookbook.” Technical University of Denmark. http://www2.compute.dtu.dk/pubdb/pubs/3274-full.html.
Schervish, Mark J. 1995. Theory of Statistics. Springer Series in Statistics. New York, NY: Springer.