$$
$$
Score Matching
Previously in Lecture 10, the concept of score as the gradient of the log-likelihood was introduced.
For a probability distribution
The score is defined as:
The Big Idea
Score matching was first proposed by Hyvärinen in 2005 (Hyvärinen 2005) as a method to estimate model parameters without computing the partition function
Hyvärinen suggested directly matching the score of the model to the score of the data by minimizing the expected squared difference between them:
Inuitive Explanation
- If the gradients of two potential functions are equal, the functions themselves differ by at most a constant:
Where
- Normalization by the partition function
does not affect the relation between the potentials:
The partition function
The relation between the two functions is preserved by the partition function, which differs by constant
If we can match the score, then we indirectly match the probability distributions without needing to first compute the partition function. This is the idea behind score matching.
Visualization of Score Matching and Potentials
Show the code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Set the Seaborn style for modern-looking plots
set(style="whitegrid", context="talk")
sns.
# Define the potential functions
def potential_model(x, shift=0):
"""Model potential with an optional scalar shift."""
return 0.5 * x**2 + shift # Quadratic potential
def potential_data(x):
"""Data potential."""
return 0.5 * x**2 # Quadratic potential with no shift
# Analytical gradients (scores)
def gradient_potential_model(x):
"""Gradient of the model potential with respect to x."""
return x
def gradient_potential_data(x):
"""Gradient of the data potential with respect to x."""
return x
# Define the range of x values
= np.linspace(-3, 3, 500)
x
# Compute potentials
= potential_model(x, shift=1)
model_potential = potential_data(x)
data_potential
# Compute gradients (scores)
= gradient_potential_model(x)
model_gradient = gradient_potential_data(x)
data_gradient
# Compute partition functions for normalization
= x[1] - x[0] # Differential element for integration
dx = np.sum(np.exp(-model_potential)) * dx
Z_model = np.sum(np.exp(-data_potential)) * dx
Z_data
# Compute normalized probability densities
= np.exp(-model_potential) / Z_model
normalized_model = np.exp(-data_potential) / Z_data
normalized_data
# Create DataFrames for plotting with Seaborn
= pd.DataFrame({
df_potentials 'x': np.tile(x, 2),
'Potential': np.concatenate([model_potential, data_potential]),
'Type': ['Model Potential (shift=1)'] * len(x) + ['Data Potential (no shift)'] * len(x)
})
= pd.DataFrame({
df_gradients 'x': np.tile(x, 2),
'Gradient': np.concatenate([model_gradient, data_gradient]),
'Type': ['Gradient of Model Potential'] * len(x) + ['Gradient of Data Potential'] * len(x)
})
= pd.DataFrame({
df_normalized 'x': np.tile(x, 2),
'Probability Density': np.concatenate([normalized_model, normalized_data]),
'Type': ['Normalized Model Potential'] * len(x) + ['Normalized Data Potential'] * len(x)
})
= (5, 3)
figsize
# Define custom dash patterns
= {
dash_styles 'Model Potential (shift=1)': '',
'Data Potential (no shift)': (5, 5), # Solid line
'Gradient of Model Potential': '',
'Gradient of Data Potential': (5, 5),
'Normalized Model Potential': '',
'Normalized Data Potential': (5, 5)
}
=figsize)
plt.figure(figsize# Plot potentials
sns.lineplot(=df_potentials,
data='x',
x='Potential',
y='Type',
hue='Type',
style=dash_styles,
dashes='deep',
palette
)'x')
plt.xlabel('Potential Energy')
plt.ylabel(='')
plt.legend(title
plt.show()
=figsize)
plt.figure(figsize# Plot gradients (scores)
sns.lineplot(=df_gradients,
data='x',
x='Gradient',
y='Type',
hue='Type',
style=dash_styles,
dashes='muted',
palette
)'x')
plt.xlabel('Gradient')
plt.ylabel(='')
plt.legend(title
plt.show()
=figsize)
plt.figure(figsize# Plot normalized probability densities
sns.lineplot(=df_normalized,
data='x',
x='Probability Density',
y='Type',
hue='Type',
style=dash_styles,
dashes='bright',
palette
)'x')
plt.xlabel('Probability Density')
plt.ylabel(='')
plt.legend(title plt.show()
Eliminating True Score
The score for the data
But
To eliminate
Integration by parts allows for the gradient term to be swapped to the score term. If we assume that the probability of the true distribution goes to zero
Back to the minimization problem, this term can be substituted back in:
Evaluation of the Objective
The minimization objective does not require using the true score which has been eliminated. The
This component places a penalty on positive curvature in the potential function, preferring highly negative curvature in regions of high probability density, similar to a Gaussian or peaked distribution. It can be though of as a regularization term that prefers peaked distributions.
The
In practice, we can estimate the true expectation over all
This can be applied to all samples, or batches of samples to get a gradient estimate for the minimization objective in
Computing the Laplacian
The Laplacian of the potential function
The trace of a matrix such as a Hessian which is a linear operator can be estimated without having to know the explicit matrix. For example the Hessian may be an operator that is too large to store in memory. Or the Hessian may not be explicitly known, but the operation
A process first proposed by Hutchinson (Hutchinson 1990), known as randomized linear algebra allows for computing the trace when only the
The normal distribution by definition has a covariance matrix that is identity
Applications of Learned Score
MAP Estimation
Now that we have a method to learn the score of a distribution, it can be used as the regularization term of the gradient in MAP estimation. In Lecture10, the score was used as part of the MAP minimization gradient.
So using many empirically drawn samples
Diffusion Models and Homotopy
In Lecture 7, the concept of a homotopy beteween two functions was introduced. A homotopy provides a continuous path between an smoothing function
Proposition: Convolution of Random Variables
The sum of two independent random variables is a convolution of their probability distributions. If
For a dataset
The homotopy proceeds slightlt differently with the time scheduling but it still begins and ends with the starting and target distributions. This new random variable