DSA 595 Lecture Notes – Bayesian Inference, MCMC, and Monte Carlo Integration

Random Variables

Random variable \( X : \Omega \to \mathbb{R} \)

Discrete random variable takes countably many values.

Probability mass function (pmf)
Examples: \(\text{Bernoulli}(p)\), \(\text{binomial}(n,p)\), \(\text{Poisson}(\lambda)\)

Continuous random variable takes all values on one or more intervals.

Probability density function (pdf)
Examples: \(\text{uniform}(a,b)\), \(\text{normal}(\mu, \sigma^2)\)

Joint distribution of random variables \(X\) and \(Y\):

Joint pmf/pdf: \( f_{X,Y}(x, y) \)

Bayes Rule

\[ f_{X|Y}(x \mid y) \;:=\; \frac{f_{X,Y}(x,y)}{f_Y(y)} \] so long as \( f_Y(y) > 0 \).

Bayesian Inference

Suppose data \( x_1, \ldots, x_n \) are observations of the independent and identically distributed random variables \( X_1, \ldots, X_n \) from some probability distribution with pmf/pdf from the parametric family

\bigl\{\, f_{X|\Theta}(x \mid \theta) \;:\; \theta \text{ some parameter(s)} \,\bigr\}.

If we happen to have prior information about the true \(\theta^*\) associated with the data, in the form of some prior pmf/pdf \(\Pi_\Theta(\theta)\), then we can define the posterior distribution

\Pi_{\Theta \mid X_1,\ldots,X_n}(\theta \mid x_1,\ldots,x_n) \;:=\; \frac{f_{X_1,\ldots,X_n|\Theta}(x_1,\ldots,x_n \mid \theta)\cdot \Pi_\Theta(\theta)} {\displaystyle\int f_{X_1,\ldots,X_n|\Theta}(x_1,\ldots,x_n \mid \vartheta)\cdot \Pi_\Theta(\vartheta)\,d\vartheta}.

In short-hand notation:

\Pi(\theta \mid x_1,\ldots,x_n) \;=\; \frac{f(x_1,\ldots,x_n \mid \theta)\cdot\Pi(\theta)} {\displaystyle\int f(x_1,\ldots,x_n \mid \vartheta)\cdot\Pi(\vartheta)\,d\vartheta}.

Example: Bernoulli Likelihood with Beta Prior

Suppose \( X_1,\ldots,X_n \overset{\text{iid}}{\sim} \text{Bernoulli}(p) \) and the prior on \(p\) is \(\text{beta}(a,b)\). Determine the posterior distribution of \(p\).

f(x_1,\ldots,x_n \mid p) = \prod_{i=1}^n p^{x_i}(1-p)^{1-x_i} = p^{\sum x_i}(1-p)^{n - \sum x_i}

\Pi(p) = \frac{\Gamma(a+b)}{\Gamma(a)\,\Gamma(b)}\,p^{a-1}(1-p)^{b-1} \quad \text{for } a,b > 0.

f(x_1,\ldots,x_n \mid p)\cdot\Pi(p) = p^{\sum x_i}(1-p)^{n-\sum x_i} \cdot \frac{\Gamma(a+b)}{\Gamma(a)\,\Gamma(b)}\,p^{a-1}(1-p)^{b-1}

= (\text{const})\cdot p^{\,a+(\sum x_i)-1}\cdot(1-p)^{\,b+(n-\sum x_i)-1}

And so

\Pi(p \mid x_1,\ldots,x_n) = \frac{(\text{const})\cdot p^{\,a+(\sum x_i)-1}\cdot(1-p)^{\,b+(n-\sum x_i)-1}} {\displaystyle\int(\text{const})\cdot\vartheta^{\,a+(\sum x_i)-1}\cdot(1-\vartheta)^{\,b+(n-\sum x_i)-1}\,d\vartheta}

= \frac{p^{\,a+(\sum x_i)-1}\cdot(1-p)^{\,b+(n-\sum x_i)-1}} {\displaystyle\int\vartheta^{\,a+(\sum x_i)-1}\cdot(1-\vartheta)^{\,b+(n-\sum x_i)-1}\,d\vartheta}

= \frac{p^{\,a+(\sum x_i)-1}\cdot(1-p)^{\,b+(n-\sum x_i)-1}} {\left\{\dfrac{\Gamma\!\left(a+\sum x_i\right)\cdot\Gamma\!\left(b+n-\sum x_i\right)}{\Gamma(a+b+n)}\right\}}

which is the pdf of the \(\text{beta}\!\left(a+\textstyle\sum x_i,\; b+n-\textstyle\sum x_i\right)\) distribution.

Posterior distributions in the same family as the prior are called "conjugate".

Example: Non-Conjugate Prior

How about an example of a non-conjugate prior?

Suppose \( X_1,\ldots,X_n \overset{\text{iid}}{\sim} N(\mu,1) \) and the prior \(\mu \sim \text{uniform}(a,b)\).

f(x_1,\ldots,x_n \mid \mu)\cdot\Pi(\mu) = \left[\prod_{i=1}^n \frac{1}{\sqrt{2\pi}}\,e^{-\frac{1}{2}(x_i-\mu)^2}\right] \cdot\frac{1}{b-a}\cdot\mathbf{1}\{\mu\in(a,b)\}

= (2\pi)^{-n/2}\,e^{-\frac{1}{2}\sum_{i=1}^n(x_i-\mu)^2} \cdot\frac{1}{b-a}\cdot\mathbf{1}\{\mu\in(a,b)\}

\propto (2\pi)^{-1/2}\,e^{-\frac{1}{2}(n\mu^2 - 2\mu n\bar{x})} \cdot\mathbf{1}\{\mu\in(a,b)\}

\propto (2\pi)^{-1/2}\,e^{-\frac{n}{2}(\mu^2 - 2\mu\bar{x} + \bar{x}^2)} \cdot\mathbf{1}\{\mu\in(a,b)\}

\propto \!\left(2\pi\,\tfrac{1}{n}\right)^{-1/2}\! e^{-\frac{1}{2(1/n)}(\mu-\bar{x})^2} \cdot\mathbf{1}\{\mu\in(a,b)\}

And the normalizing constant is

\int f(x_1,\ldots,x_n \mid \mu)\cdot\Pi(\mu)\,d\mu = \int_a^b \!\left(2\pi\,\tfrac{1}{n}\right)^{-1/2} e^{-\frac{1}{2(1/n)}(\mu-\bar{x})^2}\,d\mu = \Phi\!\left(\frac{b-\bar{x}}{1/\sqrt{n}}\right) - \Phi\!\left(\frac{a-\bar{x}}{1/\sqrt{n}}\right)

where \(\Phi(\cdot)\) is the cumulative distribution function of the standard Gaussian distribution. Then

\Pi(\mu \mid x_1,\ldots,x_n) = \frac{\left(2\pi\,\frac{1}{n}\right)^{-1/2} e^{-\frac{1}{2(1/n)}(\mu-\bar{x})^2} \cdot\mathbf{1}\{\mu\in(a,b)\}} {\Phi\!\left(\dfrac{b-\bar{x}}{1/\sqrt{n}}\right) -\Phi\!\left(\dfrac{a-\bar{x}}{1/\sqrt{n}}\right)}.

Notice in this previous example we could have even taken \((a,b) = \mathbb{R}\). This amounts to what is often referred to as a "flat" prior \(\Pi(\mu) \propto 1\). In fact, this is not a proper prior because it is not even integrable over \(\mathbb{R}\).

Example (Continued): Flat Prior

Yet, however, the resulting posterior is proper (i.e., integrates to 1).

Suppose \( X_1,\ldots,X_n \overset{\text{iid}}{\sim} N(\mu,1) \) and the prior \(\Pi(\mu) \propto 1\). Repeating the steps in the previous example, but without \("\mathbf{1}\{\mu\in(a,b)\}"\) demonstrates that

\Pi(\mu \mid x_1,\ldots,x_n) = \!\left(2\pi\,\tfrac{1}{n}\right)^{-1/2} e^{-\frac{1}{2(1/n)}(\mu-\bar{x})^2}

which is the pdf of the \(N\!\left(\bar{x},\,\tfrac{1}{n}\right)\) distribution. This happens to coincide precisely with the sampling distribution of \(\bar{x}\), when viewed as a function of \(\bar{x}\).

Random Number and Variable Generation

We will assume we have software for generating random instances from the \(\text{Uniform}(0,1)\) distribution. How to sample values from an arbitrary continuous distribution, given a sample from the \(\text{Uniform}(0,1)\) distribution?

Probability Integral Transform

Let \(X\) be a continuous random variable with cdf \(F_X\). Then

Y := F_X(X) \sim \text{Uniform}(0,1).

Why is this true? For \(y \in (0,1)\),

\begin{aligned} F_Y(y) &= P(Y \le y) \\ &= P\!\left(F_X(X) \le y\right) \\ &= P\!\left(X \le F_X^{-1}(y)\right) \\ &= F_X\!\left(F_X^{-1}(y)\right) \\ &= y. \end{aligned}

Recall that this is the cdf of the \(\text{Uniform}(0,1)\) distribution.

Accordingly, if we generate \(U \sim \text{Uniform}(0,1)\) and \(F_X\) is a continuous function, then

X = F_X^{-1}(U).

Metropolis–Hastings (MH) Algorithm

Suppose \(\Pi(\theta \mid x)\) is a posterior density of interest, but we don't know its normalizing constant. The MH algorithm allows us to sample from the posterior nonetheless, bypassing the need for the normalizing constant. Accordingly,

\Pi(\theta \mid x) = c\cdot h(\theta \mid x) \;\propto\; h(\theta \mid x),

where we know \(h(\theta \mid x)\) but not \(c\). Assume that \(q(\theta)\) is some density with \(\text{supp}\{h(\theta \mid x)\} \subseteq \text{supp}\{q(\theta)\}\).

MH Algorithm (Independent Sampling)

Given \(\theta^{(t)}\)

Randomly sample \(\theta^* \sim q\)

Assign

\theta^{(t+1)} = \begin{cases} \theta^* & \text{with probability } \rho(\theta^*,\theta^{(t)}) \\ \theta^{(t)} & \text{else} \end{cases}

where

\rho(\theta^*, \theta^{(t)}) = \min\!\left\{ \frac{h(\theta^* \mid x)\,/\,q(\theta^*)} {h(\theta^{(t)} \mid x)\,/\,q(\theta^{(t)})},\;1 \right\}

Based on this algorithm, values of \(\theta\) are accepted in direct proportion to the value of the density \(\Pi(\theta \mid x)\). The key observation is that

\frac{h(\theta^* \mid x)\,/\,q(\theta^*)} {h(\theta^{(t)} \mid x)\,/\,q(\theta^{(t)})} = \frac{c\cdot h(\theta^* \mid x)\,/\,q(\theta^*)} {c\cdot h(\theta^{(t)} \mid x)\,/\,q(\theta^{(t)})} = \frac{\Pi(\theta^* \mid x)\,/\,q(\theta^*)} {\Pi(\theta^{(t)} \mid x)\,/\,q(\theta^{(t)})}.

Markov Chain Monte Carlo (MCMC)

The MH algorithm is the canonical algorithm in a broader class of algorithms referred to as Markov chain Monte Carlo (MCMC) methods. MCMC is primarily what constitutes "Bayesian computations."

A crux of MCMC is how to accept enough proposals so that the algorithm converges quickly, while not so many that the accepted proposals are too dependent. We are trying to draw independent samples from the posterior distribution, but we are doing so via the construction of a Markov chain. Generally, we aim to accept approximately 20–50% to achieve adequate "mixing" of the algorithm. An approach to gain greater control over the acceptance rate is to formulate the proposal distribution as a random walk. For example,

q(\theta^* \mid \theta^{(t)}) \;\sim\; N(\theta^* \mid \theta^{(t)}, \tau^2)

where \(\tau > 0\) determines the scale of the proposals, centered around \(\theta^{(t)}\).

MH Algorithm (Gaussian Random Walk)

Given \(\theta^{(t)}\)

Randomly sample \(\theta^* = \theta^{(t)} + Z\), where \(Z \sim N(0, \tau^2)\)

Assign

\theta^{(t+1)} = \begin{cases} \theta^* & \text{with probability } \rho(\theta^*, \theta^{(t)}) \\ \theta^{(t)} & \text{else} \end{cases}

where

\rho(\theta^*, \theta^{(t)}) = \min\!\left\{ \frac{h(\theta^* \mid x)}{h(\theta^{(t)} \mid x)},\;1 \right\}.

Note that due to the symmetry of the \(N(0, \tau^2)\) distribution, \(q(\theta^* \mid \theta^{(t)}) = q(\theta^{(t)} \mid \theta^*)\), leading to the cancellation in the MH ratio.

Assessing and Diagnosing Convergence of MCMC Algorithms

Convergence is most often assessed via visual inspection of the trace plot, but more rigorous diagnostic metrics have been proposed and studied.

Gelman–Rubin (GR) Statistic

Run \(J\) chains of an MCMC algorithm, each with a different initial value. Denote chain \(j \in \{1,\ldots,J\}\) by \(x_1^{(j)},\ldots,x_L^{(j)}\), post burn-in phase. Define the following:

Sample mean of chain \(j\):

\bar{x}_j := \frac{1}{L}\sum_{i=1}^L x_i^{(j)}

Sample mean of sample means of \(J\) chains:

\bar{x}_* := \frac{1}{J}\sum_{j=1}^J \bar{x}_j

Sample variance of the sample means of \(J\) chains (scaled by \(L\)):

B := \frac{L}{J-1}\sum_{j=1}^J (\bar{x}_j - \bar{x}_*)^2

Sample mean of sample variances of \(J\) chains:

W := \frac{1}{J}\sum_{j=1}^J\left\{\frac{1}{L-1}\sum_{i=1}^L (x_i^{(j)} - \bar{x}_j)^2\right\}

The GR statistic:

R := \frac{\left(\dfrac{L-1}{L}\right)W + \left(\dfrac{1}{L}\right)B}{W}

As \(L \to \infty\), \(B/L \to 0\), and \(R \to 1\), if converged. It has been argued in the literature that \(R \in [1,\,1.1]\) indicates convergence.

Gibbs Sampling

Suppose \(\Pi(\theta_1,\ldots,\theta_p \mid y)\) is a posterior density with unknown normalizing constant. If, however, we can derive the fully specified conditional densities:

\begin{aligned} &\Pi(\theta_1 \mid \theta_2,\ldots,\theta_p,\,y) \\ &\quad\vdots \\ &\Pi(\theta_p \mid \theta_1,\ldots,\theta_{p-1},\,y) \end{aligned}

then we can learn the posterior via the following algorithm.

Gibbs Sampler Algorithm

Given \(\theta^{(t)} = (\theta_1^{(t)},\ldots,\theta_p^{(t)})\)

Sample

\begin{aligned} \theta_1^{(t+1)} &\sim \Pi(\,\cdot\mid\theta_2^{(t)},\ldots,\theta_p^{(t)},\,y) \\ &\quad\vdots \\ \theta_p^{(t+1)} &\sim \Pi(\,\cdot\mid\theta_1^{(t)},\ldots,\theta_{p-1}^{(t)},\,y). \end{aligned}

To understand why this works, construct the MH ratio for \(\theta_1 \mid \theta_2,\ldots,\theta_p,y\) using \(\Pi(\theta_1 \mid \theta_2,\ldots,\theta_p,y)\) as the proposal density:

\rho(\theta_1^*,\theta_1^{(t)}) = \min\!\left\{ \frac{h(\theta_1^* \mid \theta_2,\ldots,\theta_p,y)\,/\,q(\theta_1^*)} {h(\theta_1^{(t)} \mid \theta_2,\ldots,\theta_p,y)\,/\,q(\theta_1^{(t)})},\;1 \right\}

= \min\!\left\{ \frac{\Pi(\theta_1^* \mid \theta_2^{(t)},\ldots,\theta_p^{(t)},y)\,/\, \Pi(\theta_1^* \mid \theta_2^{(t)},\ldots,\theta_p^{(t)},y)} {\Pi(\theta_1^{(t)} \mid \theta_2^{(t)},\ldots,\theta_p^{(t)},y)\,/\, \Pi(\theta_1^{(t)} \mid \theta_2^{(t)},\ldots,\theta_p^{(t)},y)},\;1 \right\}

\[ = 1 \]

Monte Carlo Integration

Let \(X \sim f_X\). Then for any \(f_X\)-integrable function \(h: X \to \mathbb{R}\),

\int h(x)\,f_X(x)\,dx = E[h(X)] \approx \frac{1}{n}\sum_{i=1}^n h(x_i)

for an iid observed sample \(x_1,\ldots,x_n\). Moreover,

\int_a^b h(x)\,f_X(x)\,dx \approx \frac{1}{n}\sum_{i=1}^n h(x_i)\cdot\mathbf{1}\{a \le x_i \le b\}.

Importance Sampling

Importance sampling is another strategy for numerical integration. Suppose it is of interest to compute an analytically intractable integral of some function \(g\). Then

\int g(x)\,dx = \int \frac{g(x)}{f_X(x)}\,f_X(x)\,dx = E\!\left[\frac{g(X)}{f_X(X)}\right] \approx \frac{1}{n}\sum_{i=1}^n \frac{g(x_i)}{f_X(x_i)}

for an iid observed sample \(x_1,\ldots,x_n\), so long as \(\text{supp}(g) \subseteq \text{supp}(f_X)\).