Metropolis Hastings algorithm - a python primer
Posted on 18/04/2017 in posts machine learning implementation bayesian
%matplotlib inline
from ipywidgets import interact
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
def plot_distributions(samples, thetas, true_rv, prior_rv):
x = np.arange(samples.min()-1,samples.max()+1,0.01)
fig,ax = plt.subplots(1,2, figsize=(8,4))
ax[0].plot(thetas, prior_rv.pdf(thetas), color="b", label=r'Prior')
ax[0].axvline(x=true_rv.mean(), color="k", linestyle="--", lw=1, label="True mean")
ax[1].plot(x, true_rv.pdf(x), color="r", label=r'True')
ax[1].hist(samples, weights=np.ones_like(samples)/samples.shape[0], alpha=0.5, label=r'Samples')
ax[0].legend()
ax[1].legend()
ax[0].set_xlabel("theta")
ax[1].set_xlabel("x")
ax[0].set_title("Dist theta")
ax[1].set_title("Dist data")
fig.tight_layout()
return fig, ax
def construct_ll_func(samples, *rv_fixed_args, rv_class=stats.norm):
def log_likelihood(*rv_params):
return np.sum(rv_class(*rv_params, *rv_fixed_args).logpdf(samples), axis=-1)
return log_likelihood
Inspiration: https://www.youtube.com/watch?v=4gNpgSPal_8
Notes¶
- Metropolis hastings - Sample next location from distribution at the currect location. Move to next location based on the MH equation.
- Use a gaussian as the distribution and show the movement for arbitrary distributions.
- Gibbs sampling - Move along one dimension of the location conditional on the full current location.
Problem definition¶
Let us say we are given $n$ samples $X$ which are supposedly generated from a true distribution $p(X\mid\theta)$ which is parametrized by variables $\theta$. We wish to know the probability distribution over all possible $\theta$s, $p(\theta \mid X)$, which could have generated the data samples $X$. We start with some prior distribution $\psi(\theta)$
Ofcourse, $p(\theta \mid X)$ and $p(X\mid\theta)$ are related by the following equation:
$$ \begin{equation} p(\theta \mid X) = \frac{p(X\mid\theta)\psi(\theta)}{Z}\\ Z = \int_{\theta}{p(X\mid\theta)\psi(\theta)d{\theta}} \end{equation} $$If $p(\theta \mid X)$ doesn't have an analytic solution, then calculating it is hard because of the integral involved in calculation of $Z$. Hence, we use the monte-carlo methods to samples from $p(\theta \mid X)$ to approximate its value.
N=10
true_mean, true_std = -3, 2
true_rv = stats.norm(true_mean, true_std)
samples = true_rv.rvs(N)
prior_rv = stats.norm(0, 1)
thetas_prior = np.arange(-5,5, 0.01)
plot_distributions(samples, thetas_prior, true_rv, prior_rv)
Sampling from the markov chain¶
We want to ensure that we sample $\theta$'s such that their distribution converges to the true distribution over $\theta$s. This distribution will help us quantify our uncertainity in the values of $\theta$s. So, if the number of data samples $X$ is too few, then the uncertainity should be higher else uncertainity should reduce.
current_theta = prior_rv.rvs()
current_rv = stats.norm(current_theta, true_std)
ll = np.sum(current_rv.logpdf(samples)) # Manually specify the log likelihood
pi_theta = np.exp(ll)*prior_rv.pdf(current_theta)
print(current_theta, pi_theta, ll)
fig, ax = plot_distributions(samples, thetas_prior, true_rv, prior_rv)
ax[0].axvline(x=current_theta, color="r", linestyle="--", lw=1)
Getting the maximum likelihood estimate¶
The likelihood of the data $X$ given the parameter $\theta$ equals $p(X \mid \theta)$. If we assume, X includes independent and identically distributed (i.i.d) samples from a gaussian distribution with $\theta$ as its mean and some fixed std. deviation. Then $p(X \mid \theta) = \prod_{x \in X}p(x \mid \theta)$. In most cases we will be interested in finding the value of $\theta$ which maximized the likelihood of the data. For our current samples, the below figure show how the likelihood changes as the data changes.
log_likelihood = construct_ll_func(samples, true_std)
thetas = np.arange(-10,10,0.01)
plt.plot(thetas, [log_likelihood(theta) for theta in thetas], label="True")
lls = [log_likelihood(theta)+prior_rv.logpdf(theta)
for theta in thetas]
plt.plot(thetas, lls, color="r", label="With prior")
plt.axvline(x=-3, color="k", linestyle="--", lw=1)
plt.axvline(x=thetas[np.argmax(lls)], color="r", linestyle="--", lw=1)
plt.xlabel("theta")
plt.ylabel("Log Likelihood")
plt.legend()
Finding the posterior distribution of $\theta$ given $X$¶
However, if we are insterested in quantifying the uncertainity in the estimate of $\theta$ calculated using the data $X$, based on some prior belief of the distribution of $\theta$, the we need to find the full posterior distribution $p(\theta \mid X)$. MCMC methods help in this regard, by allowing us to sample from $p(\theta \mid X)$ without finding the value of $Z$.
Metropolis-Hastings (MH) algorithm is a simple way to get samples from this distribution. If we define a new quantity $\pi(\theta) = p(X \mid \theta)p(\theta)$, then the MH algorithm relies on the following assumption for some given transition probabilty distribution $p(\theta_{i+1} | \theta_{i})$. Then points samples from this Markov chain $\theta_{0}, \theta_{1}, ...$, will resemble the samples from the true distribution, given that $p(\theta_{i+1} | \theta_{i})$ is defined in a way, such that this sequence of samples is ergodic. i.e.,
$$ \begin{equation} \pi(\theta_{i})p(\theta_{i+1} | \theta_{i}) = \pi(\theta_{i+1})p(\theta_{i} | \theta_{i+1}) \end{equation} $$The required $p(\theta_{i+1} | \theta_{i})$ can be formulated by the following method:
- Use a proposal distribution $q(\theta_{i+1} | \theta_{i})$. E.g. $\mathcal{N}(\theta_{i}, 1)$, a gaussian distribution around the current value of $\theta_{i}$
- Now, define a probability of selecting $\theta_{i+1}$ as follows: $$ \begin{equation} \alpha(\theta_{i+1}, \theta_{i}) = min(1, \frac{\pi(\theta_{i+1})q(\theta_{i} | \theta_{i+1})}{\pi(\theta_{i})q(\theta_{i+1} | \theta_{i})}) \end{equation} $$
- Now, define $p(\theta_{i+1} | \theta_{i}) = \alpha(\theta_{i+1}, \theta_{i})q(\theta_{i+1} | \theta_{i})$
def get_alpha(theta_curr, theta_next, log_likelihood, log_prior):
ll_diff = log_likelihood(theta_next) - log_likelihood(theta_curr)
prior_diff = log_prior(theta_next) - log_prior(theta_curr)
return np.exp(np.min([0, ll_diff + prior_diff]))
theta_diffs = np.arange(-0.01,0.01,0.0001)
alpha = [
get_alpha(current_theta, current_theta + theta_diff, log_likelihood, prior_rv.logpdf)
for theta_diff in theta_diffs
]
plt.plot(current_theta + theta_diffs, alpha)
plt.axvline(x=current_theta, color="k", linestyle="--", lw=1)
plt.xlabel("Theta next")
plt.ylabel("alpha(theta_next, theta_curr)")
thetas = [current_theta]
n_samples=10000
for i in range(n_samples):
next_theta = thetas[-1] + np.random.randn()
alpha = get_alpha(thetas[-1], next_theta, log_likelihood, prior_rv.logpdf)
next_theta = [thetas[-1], next_theta][stats.bernoulli.rvs(alpha)]
thetas.append(next_theta)
plt.plot(thetas)
plt.axhline(y=np.mean(thetas[500:]), color="k", linestyle="--", lw=1)
fig, ax = plot_distributions(samples, thetas_prior, true_rv, prior_rv)
ax[0].axvline(x=np.mean(thetas[500:]), color="r", linestyle="--", lw=1)
ax[0].hist(thetas[500:], weights=np.ones_like(thetas[500:])/len(thetas[500:]), color="r", alpha=0.5)
posterior_samples = stats.norm(
np.mean(thetas[500:]), 2
).rvs(1000)
ax[1].hist(posterior_samples,
weights=np.ones_like(posterior_samples)/posterior_samples.shape[0],
color="r", alpha=0.3)
2D case¶
mean_vector = np.array([0.5, -0.2])
cov_matrix = np.array([[0.5, 0.3], [0.3, 0.5]])
true_rv_2d = stats.multivariate_normal(mean_vector, cov_matrix)
prior_rv_2d = stats.multivariate_normal(np.zeros_like(mean_vector), np.eye(mean_vector.shape[0]))
samples_2d = true_rv_2d.rvs(100)
thetas_prior_2d = np.array([[-1,-1], [1,1]])
def get_2d_dist_vals(x_range, y_range, step=0.01):
x, y = np.mgrid[
x_range[0]:x_range[1]:step,
y_range[0]:y_range[1]:step,
]
pos = np.empty(x.shape + (2,))
pos[:, :, 0] = x; pos[:, :, 1] = y
return x, y, pos
def plot_distributions_2d(samples, thetas_prior, true_rv, prior_rv):
step = 0.01
fig,ax = plt.subplots(1,2, figsize=(8,4))
## Plot theta distribution
mean_vector = true_rv.mean
x_range = [thetas_prior[:, 0].min()-1*step, thetas_prior[:, 0].max()+1*step]
y_range = [thetas_prior[:, 1].min()-1*step, thetas_prior[:, 1].max()+1*step]
x, y, pos = get_2d_dist_vals(x_range, y_range, step=step)
ax[0].contourf(x, y, prior_rv.pdf(pos), alpha=0.3, cmap="Blues", label="Prior")
ax[0].axvline(x=mean_vector[0], linestyle="--", color="r", alpha=0.5)
ax[0].axhline(y=mean_vector[1], linestyle="--", color="r", alpha=0.5)
ax[0].set_title("Dist theta")
## Plot data distribution
x_range = [samples[:, 0].min()-1*step, samples[:, 0].max()+1*step]
y_range = [samples[:, 1].min()-1*step, samples[:, 1].max()+1*step]
x, y, pos = get_2d_dist_vals(x_range, y_range, step=step)
ax[1].contourf(x, y, true_rv.pdf(pos), alpha=0.7, cmap="Reds", label="True")
## Plot samples
ax[1].plot(samples[:, 0], samples[:, 1], marker="x", linestyle="none", color="k", label="Samples")
ax[1].set_title("Dist data")
fig.tight_layout()
return fig, ax
plot_distributions_2d(samples_2d, thetas_prior_2d, true_rv_2d, prior_rv_2d)
log_likelihood_2d = construct_ll_func(samples_2d, cov_matrix, rv_class=stats.multivariate_normal)
current_theta_2d = prior_rv_2d.rvs()
thetas = [current_theta_2d]
diff_rv = stats.multivariate_normal(np.zeros_like(mean_vector), np.eye(mean_vector.shape[0]))
diff_theta = diff_rv.rvs()
next_theta = thetas[-1] + diff_theta
alpha = get_alpha(thetas[-1], next_theta, log_likelihood_2d, prior_rv_2d.logpdf)
print(thetas[-1], next_theta, alpha, diff_theta)
next_theta = [thetas[-1], next_theta][stats.bernoulli.rvs(alpha)]
thetas.append(next_theta)
plt.plot(*zip(*thetas[:-2]), lw=1, marker="x")
plt.plot(thetas[-2][0], thetas[-2][1], marker="x", color="r")
plt.plot(thetas[-1][0], thetas[-1][1], marker="x", color="k")
plt.axvline(x=mean_vector[0], linestyle="--", color="r", alpha=0.5)
plt.axhline(y=mean_vector[1], linestyle="--", color="r", alpha=0.5)
current_theta_2d = prior_rv_2d.rvs()
thetas = [current_theta_2d]
diff_rv = stats.multivariate_normal(np.zeros_like(mean_vector), np.eye(mean_vector.shape[0]))
n_samples=10000
for i in range(n_samples):
next_theta = thetas[-1] + diff_rv.rvs()
alpha = get_alpha(thetas[-1], next_theta, log_likelihood_2d, prior_rv_2d.logpdf)
next_theta = [thetas[-1], next_theta][stats.bernoulli.rvs(alpha)]
thetas.append(next_theta)
thetas = np.array(thetas)
thetas.shape
fig, ax = plot_distributions_2d(samples_2d, thetas_prior_2d, true_rv_2d, prior_rv_2d)
hb = ax[0].hexbin(thetas[500:, 0], thetas[500:, 1], gridsize=50, cmap='Reds', alpha=0.3)
cb = fig.colorbar(hb, ax=ax[0])
cb.set_label('counts')
burn_in=500
posterior_mean = thetas[burn_in:].mean(axis=0)
fig = plt.figure(figsize=(8,4))
ax = plt.subplot2grid((2,2), (0,0))
ax.plot(thetas[burn_in:, 0], lw=1)
ax.axhline(y=posterior_mean[0], lw=1, linestyle="--", color="k")
ax.axhline(y=mean_vector[0], lw=1, linestyle="--", color="r")
ax.set_xlabel("theta 0")
ax = plt.subplot2grid((2,2), (1,0))
ax.plot(thetas[burn_in:, 1], lw=1)
ax.axhline(y=posterior_mean[1], lw=1, linestyle="--", color="k")
ax.axhline(y=mean_vector[1], lw=1, linestyle="--", color="r")
ax.set_xlabel("theta 1")
ax = plt.subplot2grid((2,2), (0,1), rowspan=2)
ax.plot(thetas[:, 0], thetas[:, 1], lw=1, linestyle="--", marker="x", ms=5)
ax.plot(posterior_mean[0], posterior_mean[1], marker="o", color="k", ms=10)
ax.axvline(x=mean_vector[0], linestyle="--", color="r", alpha=0.5)
ax.axhline(y=mean_vector[1], linestyle="--", color="r", alpha=0.5)
ax.set_xlabel("theta 0")
ax.set_ylabel("theta 1")
fig.tight_layout()
thetas.mean(axis=0), mean_vector
step=0.01
fig = plt.figure(figsize=(8,4))
posterior_mean = thetas[500:].mean(axis=0)
posterior_mean_rv = stats.multivariate_normal(posterior_mean, true_rv_2d.cov)
ax = plt.subplot2grid((2,2), (0,0))
ax.hist(thetas[500:, 0], alpha=0.5)
ax.axvline(x=posterior_mean[0], linestyle="--", color="b", alpha=0.5)
ax.axvline(x=mean_vector[0], linestyle="--", color="r", alpha=0.5)
ax.set_xlabel("theta 0")
ax = plt.subplot2grid((2,2), (1,0))
ax.hist(thetas[500:, 1], alpha=0.5)
ax.axvline(x=posterior_mean[1], linestyle="--", color="b", alpha=0.5)
ax.axvline(x=mean_vector[1], linestyle="--", color="r", alpha=0.5)
ax.set_xlabel("theta 1")
ax = plt.subplot2grid((2,2), (0,1), rowspan=2)
x_range = [samples_2d[:, 0].min()-1*step, samples_2d[:, 0].max()+1*step]
y_range = [samples_2d[:, 1].min()-1*step, samples_2d[:, 1].max()+1*step]
x, y, pos = get_2d_dist_vals(x_range, y_range)
ax.contourf(x, y, true_rv_2d.pdf(pos), alpha=0.3, cmap="Reds", label="True")
ax.axvline(x=mean_vector[0], linestyle="--", color="r", alpha=0.5, label="True")
ax.axhline(y=mean_vector[1], linestyle="--", color="r", alpha=0.5)
ax.contourf(x, y, posterior_mean_rv.pdf(pos), alpha=0.3, cmap="Blues", label="Posterior mean")
ax.axvline(x=posterior_mean[0], linestyle="--", color="b", alpha=0.5, label="Posterior mean")
ax.axhline(y=posterior_mean[1], linestyle="--", color="b", alpha=0.5)
ax.set_title("Data dist")
ax.set_xlabel("theta 0")
ax.set_ylabel("theta 1")
ax.legend()
fig.tight_layout()
Gibbs sampling¶
The Gibbs sampling is an extension of the MH to allow for conditional exploration along each parameter, keeping all the other parameters constant.