Random Variables
Random variable \( X : \Omega \to \mathbb{R} \)
Discrete random variable takes countably many values.
- Probability mass function (pmf)
- Examples: \(\text{Bernoulli}(p)\), \(\text{binomial}(n,p)\), \(\text{Poisson}(\lambda)\)
Continuous random variable takes all values on one or more intervals.
- Probability density function (pdf)
- Examples: \(\text{uniform}(a,b)\), \(\text{normal}(\mu, \sigma^2)\)
Joint distribution of random variables \(X\) and \(Y\):
- Joint pmf/pdf: \( f_{X,Y}(x, y) \)
Bayes Rule
\[ f_{X|Y}(x \mid y) \;:=\; \frac{f_{X,Y}(x,y)}{f_Y(y)} \] so long as \( f_Y(y) > 0 \).
Bayesian Inference
Suppose data \( x_1, \ldots, x_n \) are observations of the independent and identically distributed random variables \( X_1, \ldots, X_n \) from some probability distribution with pmf/pdf from the parametric family
If we happen to have prior information about the true \(\theta^*\) associated with the data, in the form of some prior pmf/pdf \(\Pi_\Theta(\theta)\), then we can define the posterior distribution
In short-hand notation:
Example: Bernoulli Likelihood with Beta Prior
Suppose \( X_1,\ldots,X_n \overset{\text{iid}}{\sim} \text{Bernoulli}(p) \) and the prior on \(p\) is \(\text{beta}(a,b)\). Determine the posterior distribution of \(p\).
And so
which is the pdf of the \(\text{beta}\!\left(a+\textstyle\sum x_i,\; b+n-\textstyle\sum x_i\right)\) distribution.
Posterior distributions in the same family as the prior are called "conjugate".
Example: Non-Conjugate Prior
How about an example of a non-conjugate prior?
Suppose \( X_1,\ldots,X_n \overset{\text{iid}}{\sim} N(\mu,1) \) and the prior \(\mu \sim \text{uniform}(a,b)\).
And the normalizing constant is
where \(\Phi(\cdot)\) is the cumulative distribution function of the standard Gaussian distribution. Then
Notice in this previous example we could have even taken \((a,b) = \mathbb{R}\). This amounts to what is often referred to as a "flat" prior \(\Pi(\mu) \propto 1\). In fact, this is not a proper prior because it is not even integrable over \(\mathbb{R}\).
Example (Continued): Flat Prior
Yet, however, the resulting posterior is proper (i.e., integrates to 1).
Suppose \( X_1,\ldots,X_n \overset{\text{iid}}{\sim} N(\mu,1) \) and the prior \(\Pi(\mu) \propto 1\). Repeating the steps in the previous example, but without \("\mathbf{1}\{\mu\in(a,b)\}"\) demonstrates that
which is the pdf of the \(N\!\left(\bar{x},\,\tfrac{1}{n}\right)\) distribution. This happens to coincide precisely with the sampling distribution of \(\bar{x}\), when viewed as a function of \(\bar{x}\).
Random Number and Variable Generation
We will assume we have software for generating random instances from the \(\text{Uniform}(0,1)\) distribution. How to sample values from an arbitrary continuous distribution, given a sample from the \(\text{Uniform}(0,1)\) distribution?
Probability Integral Transform
Let \(X\) be a continuous random variable with cdf \(F_X\). Then
Why is this true? For \(y \in (0,1)\),
Recall that this is the cdf of the \(\text{Uniform}(0,1)\) distribution.
Accordingly, if we generate \(U \sim \text{Uniform}(0,1)\) and \(F_X\) is a continuous function, then
Metropolis–Hastings (MH) Algorithm
Suppose \(\Pi(\theta \mid x)\) is a posterior density of interest, but we don't know its normalizing constant. The MH algorithm allows us to sample from the posterior nonetheless, bypassing the need for the normalizing constant. Accordingly,
where we know \(h(\theta \mid x)\) but not \(c\). Assume that \(q(\theta)\) is some density with \(\text{supp}\{h(\theta \mid x)\} \subseteq \text{supp}\{q(\theta)\}\).
MH Algorithm (Independent Sampling)
Given \(\theta^{(t)}\)
Randomly sample \(\theta^* \sim q\)
Assign
where
Based on this algorithm, values of \(\theta\) are accepted in direct proportion to the value of the density \(\Pi(\theta \mid x)\). The key observation is that
Markov Chain Monte Carlo (MCMC)
The MH algorithm is the canonical algorithm in a broader class of algorithms referred to as Markov chain Monte Carlo (MCMC) methods. MCMC is primarily what constitutes "Bayesian computations."
A crux of MCMC is how to accept enough proposals so that the algorithm converges quickly, while not so many that the accepted proposals are too dependent. We are trying to draw independent samples from the posterior distribution, but we are doing so via the construction of a Markov chain. Generally, we aim to accept approximately 20–50% to achieve adequate "mixing" of the algorithm. An approach to gain greater control over the acceptance rate is to formulate the proposal distribution as a random walk. For example,
where \(\tau > 0\) determines the scale of the proposals, centered around \(\theta^{(t)}\).
MH Algorithm (Gaussian Random Walk)
Given \(\theta^{(t)}\)
Randomly sample \(\theta^* = \theta^{(t)} + Z\), where \(Z \sim N(0, \tau^2)\)
Assign
where
Note that due to the symmetry of the \(N(0, \tau^2)\) distribution, \(q(\theta^* \mid \theta^{(t)}) = q(\theta^{(t)} \mid \theta^*)\), leading to the cancellation in the MH ratio.
Assessing and Diagnosing Convergence of MCMC Algorithms
Convergence is most often assessed via visual inspection of the trace plot, but more rigorous diagnostic metrics have been proposed and studied.
Gelman–Rubin (GR) Statistic
Run \(J\) chains of an MCMC algorithm, each with a different initial value. Denote chain \(j \in \{1,\ldots,J\}\) by \(x_1^{(j)},\ldots,x_L^{(j)}\), post burn-in phase. Define the following:
Sample mean of chain \(j\):
Sample mean of sample means of \(J\) chains:
Sample variance of the sample means of \(J\) chains (scaled by \(L\)):
Sample mean of sample variances of \(J\) chains:
The GR statistic:
As \(L \to \infty\), \(B/L \to 0\), and \(R \to 1\), if converged. It has been argued in the literature that \(R \in [1,\,1.1]\) indicates convergence.
Gibbs Sampling
Suppose \(\Pi(\theta_1,\ldots,\theta_p \mid y)\) is a posterior density with unknown normalizing constant. If, however, we can derive the fully specified conditional densities:
then we can learn the posterior via the following algorithm.
Gibbs Sampler Algorithm
Given \(\theta^{(t)} = (\theta_1^{(t)},\ldots,\theta_p^{(t)})\)
Sample
To understand why this works, construct the MH ratio for \(\theta_1 \mid \theta_2,\ldots,\theta_p,y\) using \(\Pi(\theta_1 \mid \theta_2,\ldots,\theta_p,y)\) as the proposal density:
Monte Carlo Integration
Let \(X \sim f_X\). Then for any \(f_X\)-integrable function \(h: X \to \mathbb{R}\),
for an iid observed sample \(x_1,\ldots,x_n\). Moreover,
Importance Sampling
Importance sampling is another strategy for numerical integration. Suppose it is of interest to compute an analytically intractable integral of some function \(g\). Then
for an iid observed sample \(x_1,\ldots,x_n\), so long as \(\text{supp}(g) \subseteq \text{supp}(f_X)\).