Metropolis Hastings algorithm - a python primer

Posted on 18/04/2017 in posts machine learning implementation bayesian

In [1]:
%matplotlib inline
from ipywidgets import interact
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats 
In [2]:
def plot_distributions(samples, thetas, true_rv, prior_rv):
    x = np.arange(samples.min()-1,samples.max()+1,0.01)
    fig,ax = plt.subplots(1,2, figsize=(8,4))
    ax[0].plot(thetas, prior_rv.pdf(thetas), color="b", label=r'Prior')
    ax[0].axvline(x=true_rv.mean(), color="k", linestyle="--", lw=1, label="True mean")
    ax[1].plot(x, true_rv.pdf(x), color="r", label=r'True')
    ax[1].hist(samples, weights=np.ones_like(samples)/samples.shape[0], alpha=0.5, label=r'Samples')
    ax[0].set_title("Dist theta")
    ax[1].set_title("Dist data")
    return fig, ax

def construct_ll_func(samples, *rv_fixed_args, rv_class=stats.norm):
    def log_likelihood(*rv_params):
        return np.sum(rv_class(*rv_params, *rv_fixed_args).logpdf(samples), axis=-1)
    return log_likelihood



  • Metropolis hastings - Sample next location from distribution at the currect location. Move to next location based on the MH equation.
    • Use a gaussian as the distribution and show the movement for arbitrary distributions.
  • Gibbs sampling - Move along one dimension of the location conditional on the full current location.

Problem definition

Let us say we are given $n$ samples $X$ which are supposedly generated from a true distribution $p(X\mid\theta)$ which is parametrized by variables $\theta$. We wish to know the probability distribution over all possible $\theta$s, $p(\theta \mid X)$, which could have generated the data samples $X$. We start with some prior distribution $\psi(\theta)$

Ofcourse, $p(\theta \mid X)$ and $p(X\mid\theta)$ are related by the following equation:

$$ \begin{equation} p(\theta \mid X) = \frac{p(X\mid\theta)\psi(\theta)}{Z}\\ Z = \int_{\theta}{p(X\mid\theta)\psi(\theta)d{\theta}} \end{equation} $$

If $p(\theta \mid X)$ doesn't have an analytic solution, then calculating it is hard because of the integral involved in calculation of $Z$. Hence, we use the monte-carlo methods to samples from $p(\theta \mid X)$ to approximate its value.

In [3]:
true_mean, true_std = -3, 2
true_rv = stats.norm(true_mean, true_std)
samples = true_rv.rvs(N)
prior_rv = stats.norm(0, 1)
thetas_prior = np.arange(-5,5, 0.01)
plot_distributions(samples, thetas_prior, true_rv, prior_rv)
(<matplotlib.figure.Figure at 0x7fdf4116ae80>,
 array([<matplotlib.axes._subplots.AxesSubplot object at 0x7fdf4118a518>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fdf3de57588>], dtype=object))

Sampling from the markov chain

We want to ensure that we sample $\theta$'s such that their distribution converges to the true distribution over $\theta$s. This distribution will help us quantify our uncertainity in the values of $\theta$s. So, if the number of data samples $X$ is too few, then the uncertainity should be higher else uncertainity should reduce.

In [4]:
current_theta = prior_rv.rvs()
current_rv = stats.norm(current_theta, true_std)
ll = np.sum(current_rv.logpdf(samples)) # Manually specify the log likelihood
pi_theta = np.exp(ll)*prior_rv.pdf(current_theta)
print(current_theta, pi_theta, ll)
fig, ax = plot_distributions(samples, thetas_prior, true_rv, prior_rv)
ax[0].axvline(x=current_theta, color="r", linestyle="--", lw=1)
0.820086972857 3.85242779237e-13 -27.3296928092
<matplotlib.lines.Line2D at 0x7fdf3dd3f240>

Getting the maximum likelihood estimate

The likelihood of the data $X$ given the parameter $\theta$ equals $p(X \mid \theta)$. If we assume, X includes independent and identically distributed (i.i.d) samples from a gaussian distribution with $\theta$ as its mean and some fixed std. deviation. Then $p(X \mid \theta) = \prod_{x \in X}p(x \mid \theta)$. In most cases we will be interested in finding the value of $\theta$ which maximized the likelihood of the data. For our current samples, the below figure show how the likelihood changes as the data changes.

In [5]:
log_likelihood = construct_ll_func(samples, true_std)
In [6]:
thetas = np.arange(-10,10,0.01)
plt.plot(thetas, [log_likelihood(theta) for theta in thetas], label="True")
lls = [log_likelihood(theta)+prior_rv.logpdf(theta)
    for theta in thetas]
plt.plot(thetas, lls, color="r", label="With prior")
plt.axvline(x=-3, color="k", linestyle="--", lw=1)
plt.axvline(x=thetas[np.argmax(lls)], color="r", linestyle="--", lw=1)
plt.ylabel("Log Likelihood")
<matplotlib.legend.Legend at 0x7fdf3dc47438>

Finding the posterior distribution of $\theta$ given $X$

However, if we are insterested in quantifying the uncertainity in the estimate of $\theta$ calculated using the data $X$, based on some prior belief of the distribution of $\theta$, the we need to find the full posterior distribution $p(\theta \mid X)$. MCMC methods help in this regard, by allowing us to sample from $p(\theta \mid X)$ without finding the value of $Z$.

Metropolis-Hastings (MH) algorithm is a simple way to get samples from this distribution. If we define a new quantity $\pi(\theta) = p(X \mid \theta)p(\theta)$, then the MH algorithm relies on the following assumption for some given transition probabilty distribution $p(\theta_{i+1} | \theta_{i})$. Then points samples from this Markov chain $\theta_{0}, \theta_{1}, ...$, will resemble the samples from the true distribution, given that $p(\theta_{i+1} | \theta_{i})$ is defined in a way, such that this sequence of samples is ergodic. i.e.,

$$ \begin{equation} \pi(\theta_{i})p(\theta_{i+1} | \theta_{i}) = \pi(\theta_{i+1})p(\theta_{i} | \theta_{i+1}) \end{equation} $$

The required $p(\theta_{i+1} | \theta_{i})$ can be formulated by the following method:

  • Use a proposal distribution $q(\theta_{i+1} | \theta_{i})$. E.g. $\mathcal{N}(\theta_{i}, 1)$, a gaussian distribution around the current value of $\theta_{i}$
  • Now, define a probability of selecting $\theta_{i+1}$ as follows: $$ \begin{equation} \alpha(\theta_{i+1}, \theta_{i}) = min(1, \frac{\pi(\theta_{i+1})q(\theta_{i} | \theta_{i+1})}{\pi(\theta_{i})q(\theta_{i+1} | \theta_{i})}) \end{equation} $$
  • Now, define $p(\theta_{i+1} | \theta_{i}) = \alpha(\theta_{i+1}, \theta_{i})q(\theta_{i+1} | \theta_{i})$
In [7]:
def get_alpha(theta_curr, theta_next, log_likelihood, log_prior):
    ll_diff = log_likelihood(theta_next) - log_likelihood(theta_curr)
    prior_diff = log_prior(theta_next) - log_prior(theta_curr)
    return np.exp(np.min([0, ll_diff + prior_diff]))
In [8]:
theta_diffs = np.arange(-0.01,0.01,0.0001)
alpha = [
    get_alpha(current_theta, current_theta + theta_diff, log_likelihood, prior_rv.logpdf)
    for theta_diff in theta_diffs
plt.plot(current_theta + theta_diffs, alpha)
plt.axvline(x=current_theta, color="k", linestyle="--", lw=1)
plt.xlabel("Theta next")
plt.ylabel("alpha(theta_next, theta_curr)")
<matplotlib.text.Text at 0x7fdf3db43cc0>
In [9]:
thetas = [current_theta]
for i in range(n_samples):
    next_theta = thetas[-1] + np.random.randn()
    alpha = get_alpha(thetas[-1], next_theta, log_likelihood, prior_rv.logpdf)
    next_theta = [thetas[-1], next_theta][stats.bernoulli.rvs(alpha)]
plt.axhline(y=np.mean(thetas[500:]), color="k", linestyle="--", lw=1)
<matplotlib.lines.Line2D at 0x7fdf3dab8748>
In [10]:
fig, ax = plot_distributions(samples, thetas_prior, true_rv, prior_rv)
ax[0].axvline(x=np.mean(thetas[500:]), color="r", linestyle="--", lw=1)
ax[0].hist(thetas[500:], weights=np.ones_like(thetas[500:])/len(thetas[500:]), color="r", alpha=0.5)
posterior_samples = stats.norm(
    np.mean(thetas[500:]), 2

         color="r", alpha=0.3)
(array([ 0.005,  0.018,  0.077,  0.172,  0.225,  0.231,  0.171,  0.068,
         0.022,  0.011]),
 array([-7.74379311, -6.46196701, -5.18014091, -3.89831481, -2.61648871,
        -1.33466261, -0.05283651,  1.22898959,  2.51081569,  3.79264179,
 <a list of 10 Patch objects>)

2D case

In [11]:
mean_vector = np.array([0.5, -0.2])
cov_matrix = np.array([[0.5, 0.3], [0.3, 0.5]])
true_rv_2d = stats.multivariate_normal(mean_vector, cov_matrix)
prior_rv_2d = stats.multivariate_normal(np.zeros_like(mean_vector), np.eye(mean_vector.shape[0]))
samples_2d = true_rv_2d.rvs(100)
thetas_prior_2d = np.array([[-1,-1], [1,1]])
In [12]:
def get_2d_dist_vals(x_range, y_range, step=0.01):
    x, y = np.mgrid[
    pos = np.empty(x.shape + (2,))
    pos[:, :, 0] = x; pos[:, :, 1] = y
    return x, y, pos
def plot_distributions_2d(samples, thetas_prior, true_rv, prior_rv):
    step = 0.01
    fig,ax = plt.subplots(1,2, figsize=(8,4))
    ## Plot theta distribution
    mean_vector = true_rv.mean
    x_range = [thetas_prior[:, 0].min()-1*step, thetas_prior[:, 0].max()+1*step]
    y_range = [thetas_prior[:, 1].min()-1*step, thetas_prior[:, 1].max()+1*step]
    x, y, pos = get_2d_dist_vals(x_range, y_range, step=step)
    ax[0].contourf(x, y, prior_rv.pdf(pos), alpha=0.3, cmap="Blues", label="Prior")
    ax[0].axvline(x=mean_vector[0], linestyle="--", color="r", alpha=0.5)
    ax[0].axhline(y=mean_vector[1], linestyle="--", color="r", alpha=0.5)
    ax[0].set_title("Dist theta")
    ## Plot data distribution
    x_range = [samples[:, 0].min()-1*step, samples[:, 0].max()+1*step]
    y_range = [samples[:, 1].min()-1*step, samples[:, 1].max()+1*step]
    x, y, pos = get_2d_dist_vals(x_range, y_range, step=step)
    ax[1].contourf(x, y, true_rv.pdf(pos), alpha=0.7, cmap="Reds", label="True")
    ## Plot samples
    ax[1].plot(samples[:, 0], samples[:, 1], marker="x", linestyle="none", color="k", label="Samples")
    ax[1].set_title("Dist data")
    return fig, ax
In [13]:
plot_distributions_2d(samples_2d, thetas_prior_2d, true_rv_2d, prior_rv_2d)
(<matplotlib.figure.Figure at 0x7fdf3dbe2fd0>,
 array([<matplotlib.axes._subplots.AxesSubplot object at 0x7fdf3db9d518>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fdf3d8b3710>], dtype=object))
In [14]:
log_likelihood_2d = construct_ll_func(samples_2d, cov_matrix, rv_class=stats.multivariate_normal)
In [15]:
current_theta_2d = prior_rv_2d.rvs()
thetas = [current_theta_2d]
diff_rv = stats.multivariate_normal(np.zeros_like(mean_vector), np.eye(mean_vector.shape[0]))
In [16]:
diff_theta = diff_rv.rvs()
next_theta = thetas[-1] + diff_theta
alpha = get_alpha(thetas[-1], next_theta, log_likelihood_2d, prior_rv_2d.logpdf)
print(thetas[-1], next_theta, alpha, diff_theta)
next_theta = [thetas[-1], next_theta][stats.bernoulli.rvs(alpha)]

plt.plot(*zip(*thetas[:-2]), lw=1, marker="x")
plt.plot(thetas[-2][0], thetas[-2][1], marker="x", color="r")
plt.plot(thetas[-1][0], thetas[-1][1], marker="x", color="k")
plt.axvline(x=mean_vector[0], linestyle="--", color="r", alpha=0.5)
plt.axhline(y=mean_vector[1], linestyle="--", color="r", alpha=0.5)
[ 0.48919652  0.4289051 ] [-0.87582839  1.34949644] 0.0 [-1.36502492  0.92059134]
<matplotlib.lines.Line2D at 0x7fdf3d64ebe0>
In [17]:
current_theta_2d = prior_rv_2d.rvs()
thetas = [current_theta_2d]
diff_rv = stats.multivariate_normal(np.zeros_like(mean_vector), np.eye(mean_vector.shape[0]))

for i in range(n_samples):
    next_theta = thetas[-1] + diff_rv.rvs()
    alpha = get_alpha(thetas[-1], next_theta, log_likelihood_2d, prior_rv_2d.logpdf)
    next_theta = [thetas[-1], next_theta][stats.bernoulli.rvs(alpha)]
In [18]:
thetas = np.array(thetas)
(10001, 2)
In [19]:
fig, ax = plot_distributions_2d(samples_2d, thetas_prior_2d, true_rv_2d, prior_rv_2d)

hb = ax[0].hexbin(thetas[500:, 0], thetas[500:, 1], gridsize=50, cmap='Reds', alpha=0.3)
cb = fig.colorbar(hb, ax=ax[0])
In [20]:
posterior_mean = thetas[burn_in:].mean(axis=0)
fig = plt.figure(figsize=(8,4))
ax = plt.subplot2grid((2,2), (0,0))
ax.plot(thetas[burn_in:, 0], lw=1)
ax.axhline(y=posterior_mean[0], lw=1, linestyle="--", color="k")
ax.axhline(y=mean_vector[0], lw=1, linestyle="--", color="r")
ax.set_xlabel("theta 0")

ax = plt.subplot2grid((2,2), (1,0))
ax.plot(thetas[burn_in:, 1], lw=1)
ax.axhline(y=posterior_mean[1], lw=1, linestyle="--", color="k")
ax.axhline(y=mean_vector[1], lw=1, linestyle="--", color="r")
ax.set_xlabel("theta 1")

ax = plt.subplot2grid((2,2), (0,1), rowspan=2)

ax.plot(thetas[:, 0], thetas[:, 1], lw=1, linestyle="--", marker="x", ms=5)
ax.plot(posterior_mean[0], posterior_mean[1], marker="o", color="k", ms=10)
ax.axvline(x=mean_vector[0], linestyle="--", color="r", alpha=0.5)
ax.axhline(y=mean_vector[1], linestyle="--", color="r", alpha=0.5)
ax.set_xlabel("theta 0")
ax.set_ylabel("theta 1")

In [21]:
thetas.mean(axis=0), mean_vector
(array([ 0.45024316, -0.18936674]), array([ 0.5, -0.2]))
In [22]:
fig = plt.figure(figsize=(8,4))
posterior_mean = thetas[500:].mean(axis=0)
posterior_mean_rv = stats.multivariate_normal(posterior_mean, true_rv_2d.cov)

ax = plt.subplot2grid((2,2), (0,0))
ax.hist(thetas[500:, 0], alpha=0.5)
ax.axvline(x=posterior_mean[0], linestyle="--", color="b", alpha=0.5)
ax.axvline(x=mean_vector[0], linestyle="--", color="r", alpha=0.5)
ax.set_xlabel("theta 0")

ax = plt.subplot2grid((2,2), (1,0))
ax.hist(thetas[500:, 1], alpha=0.5)
ax.axvline(x=posterior_mean[1], linestyle="--", color="b", alpha=0.5)
ax.axvline(x=mean_vector[1], linestyle="--", color="r", alpha=0.5)
ax.set_xlabel("theta 1")

ax = plt.subplot2grid((2,2), (0,1), rowspan=2)

x_range = [samples_2d[:, 0].min()-1*step, samples_2d[:, 0].max()+1*step]
y_range = [samples_2d[:, 1].min()-1*step, samples_2d[:, 1].max()+1*step]
x, y, pos = get_2d_dist_vals(x_range, y_range)
ax.contourf(x, y, true_rv_2d.pdf(pos), alpha=0.3, cmap="Reds", label="True")
ax.axvline(x=mean_vector[0], linestyle="--", color="r", alpha=0.5, label="True")
ax.axhline(y=mean_vector[1], linestyle="--", color="r", alpha=0.5)

ax.contourf(x, y, posterior_mean_rv.pdf(pos), alpha=0.3, cmap="Blues", label="Posterior mean")
ax.axvline(x=posterior_mean[0], linestyle="--", color="b", alpha=0.5, label="Posterior mean")
ax.axhline(y=posterior_mean[1], linestyle="--", color="b", alpha=0.5)

ax.set_title("Data dist")
ax.set_xlabel("theta 0")
ax.set_ylabel("theta 1")


Gibbs sampling

The Gibbs sampling is an extension of the MH to allow for conditional exploration along each parameter, keeping all the other parameters constant.

In [ ]: