Goal: Sample from $p$ or approximate $Ef(x)$ where the distribution $p$ is super complicated, very very high dimension.
Problem 1: This $p$ is just too complicated and doing analytical analysis is just impossible
Problem 2: So, we want to sample from $p$, but it is also too complicated
Problem 3: We can’t come up a good proposal distribution because of the high dimension
Intuition: In high dimentional region, the space might be very small. You want to sample from this small space. Obviously, doing randomly search is just not feasible.
The idea is, you start from somewhere and start moving around. And you try to move towards high probability. And you get there, you try to stay at region of high probability. You sort of doing this random walk. And you trying to stay at this high probability region. Just moving around and exploring this space. And you doing this by forming a Markov chain.
If you take independent and identically distributed samples x from an unknown high-dimentional distribution p(x), then as the number of samples gets larger the sample distribution will converge to the true distribution.
If we use methods that don’t know anything about the probability (such as sampling based on a uniform grid and using splines), then we have to treat all areas of the space as equally likely, which means that there is going to be a lot of computational resources wasted.
In addition to using the samples to approximate the expectation, we can also find a maximum, that is, the most likely outcome.
In probabilitic terms a chain is a sequence of possible states, where the probability of being in state $s$ at time $t$ is a function of the previous states. A Markov chain is a chain with the Markov propoerty, i.e., the probability at time $t$ depends only on the state at $t - 1$. The set of possible states are linked together by transition probabilities that say how likely it is that you move from the current state to each of the others, and they are generally written as a matrix $T$.
A regular markov chain converge to a unique stationary distribution regardless of start state.
Sufficient condition for regularity:
Using a Markov chain
Goal: compute $P(x \in S)$, this distribution. But $P$ is hard to sample from directly
Construct a Markov chain $T$ whose unique stationary distribution is $P$
Sample $x^0$ from some $P(0)$
For $t = 0, 1, 2, \dots$, generate $x^{t+1}$ from $T(x^t \rightarrow x)$
Because the Chain converge, so eventually, we are going to get $P$.
We only want to use samples that are sampled from a distribution closed to $P$, at early iteration, $p(t)$ is usually far from $P$, therefore, start collecting samples only after the chain has run long enough to “mix”.
Mixing
How do you know if a chain has mixed? You don’t
Using the samples
Once the chain mixed, all samples $x^t$ are from the stationary distribution $\pi$
Where the Markov chain comes from? How to we design a Markov chain with a desirable stationary distribution?
Reversible Chains
So the key idea behind the Metropolis-Hastings is the notion of reversible chain.
Imagine we have a chain with a particular stationary distribution $\pi$. We pick a state according to the stationary distribution $\pi$ (red one) and we pick a random edge from the state that we picked according to the transition model defined by the chain T. We do this experiment again, and this time we pick the green one.
A chain is reversible, if the probability of traverse from the red edge is the same with the green edge.
$\pi(x)T(x \rightarrow x’) = \pi(x’)T(x’ \rightarrow x)$
Theorem: If detailed balance holds, and T is regular, then T has a unique stationary distribution $\pi$
Metropolis Hastings Chain
It starts out by saying, we want to move around broadly in the state space, so we are going to have a distribution queue, which looks like a transition model. $Q$ is going to roam freely around the state space. Unlike gibbs, it could explore a wide range of the state space.
I’m going to have a critic. The critic is going to say, you can not go to take space because it is not going to give you the right stationary distribution.
The critic listens to the proposal that was made by the Proposal distribution $Q$.
Proposal distribution $Q(x \rightarrow x’)$
Acceptance proability: $A(x \rightarrow x’)$
So it gives us a tansition model T,
$T(x \rightarrow x’) = Q(x \rightarrow x’)A(x \rightarrow x’)$ if $x’ \neq x$
$T(x \rightarrow x) = Q(x \rightarrow x) + \displaystyle\sum_{x’ \neq x}Q(x \rightarrow x’)(1 - A(x \rightarrow x’))$
We we are going to use the detailed balance to construct the Acceptance probability that we want.
$\pi(x)T(x \rightarrow x’) = \pi(x’)T(x’ \rightarrow x)$
The goal is to construct $A$ so that the detailed balance holds for $Q$.
$\pi(x)Q(x \rightarrow x’)A(x \rightarrow x’) = \pi(x’)Q(x’ \rightarrow x)A(x’ \rightarrow x)$
$\frac{A(x \rightarrow x’)}{A(x’ \rightarrow x)} = \frac{\pi(x’)Q(x’ \rightarrow x)}{\pi(x)Q(x \rightarrow x’)}$
we pick, $A(x \rightarrow x’) = p, A(x’ \rightarrow x) = 1$
$A(x \rightarrow x’) = min[1, \frac{\pi(x’)Q(x’ \rightarrow x)}{\pi(x)Q(x \rightarrow x’)}]$
Choice of Q