MCMC

Introduction

Intuition of high dimensional space

Goal: Sample from $p$ or approximate $Ef(x)$ where the distribution $p$ is super complicated, very very high dimension.

Problem 1: This $p$ is just too complicated and doing analytical analysis is just impossible

Problem 2: So, we want to sample from $p$, but it is also too complicated

Problem 3: We can’t come up a good proposal distribution because of the high dimension

Intuition: In high dimentional region, the space might be very small. You want to sample from this small space. Obviously, doing randomly search is just not feasible.

The idea is, you start from somewhere and start moving around. And you try to move towards high probability. And you get there, you try to stay at region of high probability. You sort of doing this random walk. And you trying to stay at this high probability region. Just moving around and exploring this space. And you doing this by forming a Markov chain.

Imgur

Monte Carlo principle

If you take independent and identically distributed samples x from an unknown high-dimentional distribution p(x), then as the number of samples gets larger the sample distribution will converge to the true distribution.

If we use methods that don’t know anything about the probability (such as sampling based on a uniform grid and using splines), then we have to treat all areas of the space as equally likely, which means that there is going to be a lot of computational resources wasted.

In addition to using the samples to approximate the expectation, we can also find a maximum, that is, the most likely outcome.

Markov Chain Monte Carlo

In probabilitic terms a chain is a sequence of possible states, where the probability of being in state $s$ at time $t$ is a function of the previous states. A Markov chain is a chain with the Markov propoerty, i.e., the probability at time $t$ depends only on the state at $t - 1$. The set of possible states are linked together by transition probabilities that say how likely it is that you move from the current state to each of the others, and they are generally written as a matrix $T$.

Regular Markov chain

A regular markov chain converge to a unique stationary distribution regardless of start state.

Sufficient condition for regularity:

Every two states are connected with a path of probability greater than 0
For every state, there is a self transition

Using a Markov chain

Goal: compute $P(x \in S)$, this distribution. But $P$ is hard to sample from directly

Construct a Markov chain $T$ whose unique stationary distribution is $P$

Sample $x^0$ from some $P(0)$

For $t = 0, 1, 2, \dots$, generate $x^{t+1}$ from $T(x^t \rightarrow x)$

Because the Chain converge, so eventually, we are going to get $P$.

We only want to use samples that are sampled from a distribution closed to $P$, at early iteration, $p(t)$ is usually far from $P$, therefore, start collecting samples only after the chain has run long enough to “mix”.

Mixing

How do you know if a chain has mixed? You don’t

In general, you can never prove it mixed
But in many cases you can show it has not

Using the samples

Once the chain mixed, all samples $x^t$ are from the stationary distribution $\pi$

Metropolis-Hastings Algorithm

Where the Markov chain comes from? How to we design a Markov chain with a desirable stationary distribution?

Reversible Chains

So the key idea behind the Metropolis-Hastings is the notion of reversible chain.

Imagine we have a chain with a particular stationary distribution $\pi$. We pick a state according to the stationary distribution $\pi$ (red one) and we pick a random edge from the state that we picked according to the transition model defined by the chain T. We do this experiment again, and this time we pick the green one.

Imgur

A chain is reversible, if the probability of traverse from the red edge is the same with the green edge.

$\pi(x)T(x \rightarrow x’) = \pi(x’)T(x’ \rightarrow x)$

Theorem: If detailed balance holds, and T is regular, then T has a unique stationary distribution $\pi$

Metropolis Hastings Chain

It starts out by saying, we want to move around broadly in the state space, so we are going to have a distribution queue, which looks like a transition model. $Q$ is going to roam freely around the state space. Unlike gibbs, it could explore a wide range of the state space.

I’m going to have a critic. The critic is going to say, you can not go to take space because it is not going to give you the right stationary distribution.

The critic listens to the proposal that was made by the Proposal distribution $Q$.

Proposal distribution $Q(x \rightarrow x’)$

Acceptance proability: $A(x \rightarrow x’)$

At each state $x$, sample $x’$ from $Q(x \rightarrow x’$
Accept proposal with probability $A(x \rightarrow x’)$
- If proposal accepted, move to $x’$
- Otherwise stay at $x$

So it gives us a tansition model T,

$T(x \rightarrow x’) = Q(x \rightarrow x’)A(x \rightarrow x’)$ if $x’ \neq x$

$T(x \rightarrow x) = Q(x \rightarrow x) + \displaystyle\sum_{x’ \neq x}Q(x \rightarrow x’)(1 - A(x \rightarrow x’))$

We we are going to use the detailed balance to construct the Acceptance probability that we want.

$\pi(x)T(x \rightarrow x’) = \pi(x’)T(x’ \rightarrow x)$

The goal is to construct $A$ so that the detailed balance holds for $Q$.

$\pi(x)Q(x \rightarrow x’)A(x \rightarrow x’) = \pi(x’)Q(x’ \rightarrow x)A(x’ \rightarrow x)$

$\frac{A(x \rightarrow x’)}{A(x’ \rightarrow x)} = \frac{\pi(x’)Q(x’ \rightarrow x)}{\pi(x)Q(x \rightarrow x’)}$

we pick, $A(x \rightarrow x’) = p, A(x’ \rightarrow x) = 1$

$A(x \rightarrow x’) = min[1, \frac{\pi(x’)Q(x’ \rightarrow x)}{\pi(x)Q(x \rightarrow x’)}]$

Choice of Q

Q must be reversible
Opposing forces
- Q should try to spread out, to improve Mixing
- But then acceptance probability often low

Published 03 August 2015

Machine Learning