ST 503 – Fundamentals of Linear Models and Regression

Linear Algebra Review

Linear algebra is the study of vector spaces, linear transformations, matrices, and inner product spaces.

Definition. A vector space $V$ over a field $F$ is a set of elements $v \in V$, with the operations of addition and scalar multiplication, that satisfy the following axioms.

$\forall\, x, y \in V,\quad x + y = y + x$
$\forall\, x, y, z \in V,\quad (x + y) + z = x + (y + z)$
There exists an element “$0$” $\in V$ s.t. $x + 0 = x$, $\forall\, x \in V$
$\forall\, x \in V\; \exists\, y \in V$ s.t. $x + y = 0$
There exists an element “$1$” $\in F$ s.t. $1 \cdot x = x$, $\forall\, x \in V$
$\forall\, a, b \in F,\; \forall\, x \in V,\quad (ab)x = a(bx)$
$\forall\, a \in F,\; \forall\, x, y \in V,\quad a(x + y) = ax + ay$
$\forall\, a, b \in F,\; \forall\, x \in V,\quad (a + b)x = ax + bx$ $\square$

In statistics contexts, typically $V = \mathbb{R}^n$ and $F = \mathbb{R}$.

The notion of a vector space is truly foundational for data science. A vector space gives us an entire framework and intuition for thinking about data objects (even as data are stored as objects in a computer).

Definition. A collection of vectors $x_1, \ldots, x_p$ are said to be linearly dependent if and only if there exist scalars $c_1, \ldots, c_p$, not all zero, such that

\sum_{i=1}^{p} c_i x_i = 0.

A set of vectors that are *not* linearly dependent are said to be linearly independent. $\square$

Often, to show that a set of vectors are linearly independent, a good strategy is to demonstrate that $c_1 x_1 + \cdots + c_p x_p = 0$ implies that $c_1 = \cdots = c_p = 0$.

Note that if $x_1, \ldots, x_p \in \mathbb{R}^n$, then there can be at most $\min\{n, p\}$ linearly independent of the $p$ vectors.

Definition. A basis, $B$, for a vector space, $V$, is a linearly independent subset of $V$ with $\text{span}(B) = V.$ $\square$

Example. For $V = \mathbb{R}^n$, the set $\{e_1, \ldots, e_n\}$, where $e_i \in \mathbb{R}^n$ has $i$th component with value 1 and 0 for all other components, is a basis. This is commonly called the standard basis, and $e_i$ a standard basis vector. $\square$

Theorem. Let $V$ be a vector space and $B := \{u_1, \ldots, u_n\} \subseteq V$. Then $B$ is a basis for $V$ if and only if every $v \in V$ can be expressed uniquely as

v = a_1 u_1 + \cdots + a_n u_n

for some scalars $a_1, \ldots, a_n$. $\square$

Proof. Left as an exercise. $\square$

Theorem. If a vector space $V$ is generated by some finite set $S$, then some subset of $S$ is a basis for $V$. $\square$

Proof. Left as an exercise. $\square$

Definition. Let $V$ and $W$ be vector spaces over field $\mathbb{R}$. We call a function $T : V \to W$ a linear transformation from $V$ to $W$ if $\forall\, x, y \in V$ and $\forall\, c \in \mathbb{R}$,

$T(x + y) = T(x) + T(y)$
$T(cx) = cT(x).$ $\square$

Example. Let $A \in \mathbb{R}^{p \times q}$ and define $T(x) := Ax$ $\forall\, x \in \mathbb{R}^q$. Then $T : \mathbb{R}^q \to \mathbb{R}^p$ is a linear transformation since

$\forall\, x, y \in \mathbb{R}^q,\quad T(x + y) = A(x + y) = Ax + Ay = T(x) + T(y)$, and
$\forall\, x \in \mathbb{R}^q,\; \forall\, c \in \mathbb{R},\quad T(cx) = A \cdot (cx) = cAx = cT(x).$ $\square$

Example. Let $V = C(\mathbb{R})$, the vector space of continuous real-valued functions defined on $\mathbb{R}$. Then $\forall\, a, b \in \mathbb{R}$ with $a < b$, the transformation $T : V \to \mathbb{R}$ defined by,

T(f) := \int_a^b f(x)\, dx

is linear. $\square$

Definition. Let $V$ and $W$ be vector spaces, and $T : V \to W$ a linear transformation. The set,

\text{null}(T) := \{x \in V : T(x) = 0\},

is called the null space of $T$. The range of $T$ is defined as,

\text{range}(T) := \{y \in W : T(x) = y \text{ for some } x \in V\}.

Further, $\text{rank}(T) := \dim(\text{range}(T))$. $\square$

Theorem. (dimension theorem) Let $V$ and $W$ be vector spaces, and $T : V \to W$ be linear. If $V$ is finite-dimensional, then

\dim(\text{null}(T)) + \text{rank}(T) = \dim(V).

$\square$

Proof. See any linear algebra textbook. $\square$

Recall that if $T(x) := Ax$, then $\text{range}(T)$ is called the column space of $A$, denoted $\text{col}(A)$. Similarly for $\text{range}(T')$, called the row space of $A$, and denoted by $\text{row}(A)$.

Definition. Let $A \in \mathbb{R}^{p \times q}$ and $B \in \mathbb{R}^{q \times m}$. Then the matrix product, denoted by $AB$, has components defined as

(AB)_{ij} := \sum_{k=1}^{q} A_{ik} B_{kj}

for $i \in \{1, \ldots, p\}$ and $j \in \{1, \ldots, m\}$. $\square$

Further notions for matrices.

The transpose of $A$, denoted $A'$, satisfies the property that $(A')_{ij} = A_{ji}$ for all indices $i, j$.
$A \in \mathbb{R}^{p \times q}$ is called square if $p = q$.
A square matrix is called symmetric if $A' = A$.
$A \in \mathbb{R}^{p \times p}$ is said to be invertible if $\exists\, B \in \mathbb{R}^{p \times p}$ s.t. $AB = I_p = BA$. In that case, $A^{-1} := B$.
$A \in \mathbb{R}^{p \times p}$, the trace of $A$ is defined as $\displaystyle\text{tr}(A) := \sum_{i=1}^{p} A_{ii}$.

Selected properties of matrices.

$(AB)' = B'A'$
$A \in \mathbb{R}^{p \times p}$ is invertible if and only if $\text{rank}(A) = p$ (equivalently $\det(A) \neq 0$).
If both $A, B \in \mathbb{R}^{p \times p}$ are invertible, then $(AB)^{-1} = B^{-1}A^{-1}$.
For $A$ and $B$ of appropriate dimensions, $\text{tr}(AB) = \text{tr}(BA)$.

Definition. Let $A \in \mathbb{R}^{p \times p}$. A nonzero vector $v \in \mathbb{R}^p$ is called an eigenvector of $A$ if $Av = \lambda v$ for some scalar $\lambda$, called an eigenvalue. $\square$

Theorem. A scalar $\lambda$ is an eigenvalue of a matrix $A \in \mathbb{R}^{p \times p}$ if and only if $\det(A - \lambda I_p) = 0$. The polynomial $f(t) := \det(A - tI_p)$ is called the characteristic polynomial of $A$. $\square$

Proof. See any linear algebra textbook. $\square$

For example, if

A = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix},

then $\forall\, v \in \mathbb{R}^3$, $Av = 1 \cdot v$. How about

A = \begin{pmatrix} 1 & 2 & 3 \\ 0 & 4 & 5 \\ 0 & 0 & 6 \end{pmatrix}?

f(t) = \det(A - tI_3) \] \[ = \det \begin{pmatrix} 1-t & 2 & 3 \\ 0 & 4-t & 5 \\ 0 & 0 & 6-t \end{pmatrix} \] \[ = (1-t)\det\begin{pmatrix} 4-t & 5 \\ 0 & 6-t \end{pmatrix} - 0 + 0 \] \[ = (1-t)\left[(4-t)(6-t) - 0\right] \] \[ = (1-t)(4-t)(6-t)

which is a cubic polynomial with roots $t = 1$, $t = 4$, $t = 6$.

How to find the eigenvectors of

A = \begin{pmatrix} 1 & 2 & 3 \\ 0 & 4 & 5 \\ 0 & 0 & 6 \end{pmatrix},

and verify that $1, 4, 6$ are the eigenvalues of $A$?

Observe that the eigenvectors must satisfy the system of linear equations

Av = \lambda v \quad \text{or} \quad (A - \lambda I)v = 0

\begin{aligned} (1 - \lambda)v_1 + 2v_2 + 3v_3 &= 0 \\ (4 - \lambda)v_2 + 5v_3 &= 0 \\ (6 - \lambda)v_3 &= 0 \end{aligned}

For $\lambda = 6$,

\left. \begin{aligned} -5v_1 + 2v_2 + 3v_3 &= 0 \\ -2v_2 + 5v_3 &= 0 \\ 0 &= 0 \end{aligned} \right\} \text{WLOG take } v_3 = 1

v_2 = \tfrac{5}{2} \] \[ v_1 = \tfrac{2}{5}v_2 + \tfrac{3}{5} = \tfrac{8}{5}

To verify,

A \cdot v = \begin{pmatrix} 1 & 2 & 3 \\ 0 & 4 & 5 \\ 0 & 0 & 6 \end{pmatrix} \begin{pmatrix} 8/5 \\ 5/2 \\ 1 \end{pmatrix} = \begin{pmatrix} 48/5 \\ 15 \\ 6 \end{pmatrix} = 6 \cdot \begin{pmatrix} 8/5 \\ 5/2 \\ 1 \end{pmatrix} = \lambda v

In general, for a system of linear equations in row echelon form,

\begin{pmatrix} M_{11} & M_{12} & \cdots & M_{1p} \\ 0 & M_{22} & \cdots & M_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & M_{pp} \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_p \end{pmatrix} = \begin{pmatrix} b_1 \\ b_2 \\ \vdots \\ b_p \end{pmatrix}

Suppose $M_{ij} \neq 0$ $\forall\, j \geq i$. Then

x_p = \frac{b_p}{M_{pp}} \] \[ \vdots \] \[ x_2 = \left(b_2 - \sum_{j=3}^{p} M_{2j} x_j\right) \frac{1}{M_{22}} \] \[ x_1 = \left(b_1 - \sum_{j=2}^{p} M_{1j} x_j\right) \frac{1}{M_{11}}

In general, this is called back substitution

x_i = \left(b_i - \sum_{j=i+1}^{p} M_{ij} x_j\right) \frac{1}{M_{ii}} \qquad \text{for } i \in \{1, \ldots, p\}

Eigenvalues have an important role in matrix decompositions for certain classes of matrices. In particular, consider the following fundamental result for the class of symmetric matrices

Spectral Theorem. For any symmetric matrix $A \in \mathbb{R}^{p \times p}$ there exists an orthogonal matrix $Q$ (i.e., $Q'Q = QQ' = I_p$) such that,

\[ A = QDQ', \]

where $D$ is a diagonal matrix composed of the eigenvalues of $A$. $\square$

Note also that the columns of $Q$ form an orthonormal basis for $\mathbb{R}^p$ consisting of eigenvectors of $A$.

The basis part means that every vector $v \in \mathbb{R}^p$ can be expressed as

v = \sum_{i=1}^{p} Q_i a_i,

for some $a_1, \ldots, a_p \in \mathbb{R}$, and that $Q_1, \ldots, Q_p$ are linearly independent.

Recall the definition of linear independence:

Vectors $u_1, \ldots, u_n \in \mathbb{R}^p$ are said to be linearly independent if

\sum_{i=1}^{n} u_i a_i = 0 \quad \text{implies that} \quad a_1 = \cdots = a_n = 0

A set of vectors are said to be linearly dependent if they are not linearly independent.

Orthonormal means $\forall\, i, j \in \{1, \ldots, p\}$

Q_i' Q_j = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{if } i \neq j \end{cases}.

Next, verify that the columns $Q_1, \ldots, Q_p$ are eigenvectors of $A$ corresponding to eigenvalues given by the diagonal components of $D$. Take any $i \in \{1, \ldots, p\}$:

AQ_i = QDQ'Q_i = QD \begin{pmatrix} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{pmatrix} = Q \begin{pmatrix} 0 \\ \vdots \\ D_{ii} \\ \vdots \\ 0 \end{pmatrix} = D_{ii} \cdot Q_i

Thus, $Q_i$ is an eigenvector of $A$, by definition, and by the same definition $D_{ii}$ is an eigenvalue.

Further properties: $\displaystyle\text{tr}(A) = \sum_{i=1}^{p} \lambda_i$ and $\displaystyle\det(A) = \prod_{i=1}^{p} \lambda_i$.

\text{tr}(A) = \text{tr}(QDQ') = \text{tr}(Q'QD) = \text{tr}(D) = \sum_{i=1}^{p} D_{ii}

\det(A) = \det(QDQ') = \det(Q) \cdot \det(D) \cdot \det(Q') = \det(QQ') \cdot \det(D) = 1 \cdot \prod_{i=1}^{p} D_{ii}

What if a matrix is not symmetric nor even square? How to decompose?

Let $X \in \mathbb{R}^{n \times p}$. Then $X'X \in \mathbb{R}^{p \times p}$ and $XX' \in \mathbb{R}^{n \times n}$, and both are symmetric:

(X'X)' = X'(X')' = X'X \] \[ (XX')' = (X')'\, X' = XX'

Accordingly, we can apply the Spectral theorem to both $X'X$ and $XX'$ to decompose as

X'X = V\Delta_1 V' \qquad \text{and} \qquad XX' = U\Delta_2 U'

for matrices $V \in \mathbb{R}^{p \times p}$ and $U \in \mathbb{R}^{n \times n}$, both orthogonal matrices, and diagonal matrices $\Delta_1$ and $\Delta_2$.

Moreover, it can be shown that $X = U\Sigma V'$, where $\Sigma$ is a diagonal matrix with nonzero diagonal components equal to the square roots of the nonzero diagonal components of both $\Delta_1$ and $\Delta_2$. The diagonal components of $\Sigma$ are called the singular values of $X$. For symmetric matrices $A \in \mathbb{R}^{p \times p}$ with singular values $\sigma_1, \ldots, \sigma_p$ and eigenvalues $\lambda_1, \ldots, \lambda_p$,

\sigma_i = |\lambda_i| \quad \text{for } i \in \{1, \ldots, p\}

SVD and other decompositions are important for data compression and dimension reduction. Consider this for $X \in \mathbb{R}^{n \times p}$ with $\text{rank}(X) = r$,

\begin{aligned} X &= U\Sigma V' = (U_1, \ldots, U_n) \begin{pmatrix} \sigma_1 & & & \\ & \ddots & & \\ & & \sigma_r & \\ & & & 0 \\ & & & & \ddots \\ & & & & & 0 \end{pmatrix} \begin{pmatrix} V_1' \\ \vdots \\ V_p' \end{pmatrix} \\[1.5em] &= (U_1, \ldots, U_r) \begin{pmatrix} \sigma_1 & & \\ & \ddots & \\ & & \sigma_r \end{pmatrix} \begin{pmatrix} V_1' \\ \vdots \\ V_r' \end{pmatrix} \quad \text{“compact SVD”} \end{aligned}

where $U_1, \ldots, U_n \in \mathbb{R}^n$ and $V_1, \ldots, V_r \in \mathbb{R}^p$. Storing raw $X$ requires $n \cdot p$ pieces of information. Storing the SVD requires

n \cdot r + r + p \cdot r = (n + p + 1) \cdot r

pieces of information. For example, suppose $n = 100$, $p = 30$. Then

(n + p + 1) \cdot r = 131 \cdot r < 3000 = n \cdot p \quad \text{if } r < 23.

Furthermore, if $\sigma_k, \ldots, \sigma_r \approx 0$, for some $k \leq r$, then

X \approx (U_1, \ldots, U_k) \begin{pmatrix} \sigma_1 & & \\ & \ddots & \\ & & \sigma_k \end{pmatrix} \begin{pmatrix} V_1' \\ \vdots \\ V_k' \end{pmatrix}

which allows for even less memory requirements.

Least Squares Problems

Suppose that $X \in \mathbb{R}^{n \times p}$ and $y \in \mathbb{R}^n$. Consider the conditions in which the system

\[ Xb = y \]

has a solution for some vector $b \in \mathbb{R}^p$. As we have seen, to determine whether $y \in \text{col}(X)$,

Gaussian elimination to express system in row echelon form.
Back substitution algorithm

In the special case that $p = n$ and $\text{rank}(X) = n$, $b = X^{-1}y$

How about the system when $p < n$?

$X$ is tall and narrow (i.e., more rows than columns)
Likely, $y \notin \text{col}(X)$
This is the context of least squares problems

If the system $Xb = y$ does not have a solution then perhaps the next best thing would be to find the $b \in \mathbb{R}^p$ which makes $Xb$ as “close” as possible to $y$.

Q(b) := \|y - Xb\|^2.

The $\underset{b}{\text{argmin}}\{Q(b)\}$ is called the least squares solution.

Desirable properties of $Q(b)$:

Convex
Differentiable
$\underset{b}{\text{argmin}}\{Q(b)\} = \{b : \nabla_b Q(b) = 0\}$

Recall that for a function $f : \mathbb{R}^p \to \mathbb{R}$,

\nabla_x f(x) := \begin{pmatrix} \dfrac{\partial f(x)}{\partial x_1} \\[6pt] \vdots \\[2pt] \dfrac{\partial f(x)}{\partial x_p} \end{pmatrix} \in \mathbb{R}^p.

Lemma. For $a, b \in \mathbb{R}^p$ and $A \in \mathbb{R}^{p \times p}$, the following properties hold.

$\nabla_b(a'b) = a$
$\nabla_b(b'Ab) = (A + A')b$ $\square$

Proof. See any multivariate calculus textbook. $\square$

Observing the fact that

Q(\beta) = (y - X\beta)'(y - X\beta) = y'y - y'X\beta - \beta'X'y + \beta'X'X\beta,

it follows by the lemma that

\nabla_\beta Q(\beta) = -2X'y + (X'X + X'X)\beta = -2X'(y - X\beta).

Setting $\nabla_\beta Q(\beta) = 0$ yields the “normal equations”

X'X\beta = X'y.

Example. Recall a simple linear regression,

y_i = \beta_0 + \beta_1 x_i + u_i \qquad \text{for } i \in \{1, \ldots, n\}.

In this case,

y = X\beta + u, \qquad \text{where} \quad X := \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}.

Then

X'X\beta = \begin{pmatrix} n & \displaystyle\sum_i x_i \\[6pt] \displaystyle\sum_i x_i & \displaystyle\sum_i x_i^2 \end{pmatrix} \beta \qquad \text{and} \qquad X'y = \begin{pmatrix} \displaystyle\sum_i y_i \\[6pt] \displaystyle\sum_i x_i y_i \end{pmatrix}.

Accordingly, the normal equations are

n\beta_0 + n\bar{x}_n \beta_1 = n\bar{y}_n \] \[ n\bar{x}_n \beta_0 + \sum_i x_i^2\, \beta_1 = \sum_i x_i y_i,

and so the least squares solution for $\beta_0$ and $\beta_1$ are

\begin{aligned} \hat{\beta}_0 &= \bar{y}_n - \bar{x}_n \hat{\beta}_1 \\[1em] \hat{\beta}_1 &= \frac{\displaystyle\sum_i y_i(x_i - \bar{x}_n)}{\displaystyle\sum_i (x_i - \bar{x}_n)^2}. \end{aligned}

$\square$

Moreover, it can be mathematically proven that there always exists a solution to the normal equations. In particular, when $\text{rank}(X) = p$, the unique least squares solution has the form

\hat{b} = (X'X)^{-1} X'y.

This solution turns out to be the coefficients corresponding to the orthogonal projection of $y$ onto $\text{col}(X)$:

\hat{y} = X\hat{b} = \underbrace{X(X'X)^{-1}X'}_{=:\, \mathcal{P}_X} y

A linear transformation $\mathcal{P} : \mathbb{R}^n \to \mathbb{R}^n$ is said to be a projection onto some subspace $V \subseteq \mathbb{R}^n$ if

$\mathcal{P}v \in V$ for every $v \in \mathbb{R}^n$
$\mathcal{P}v = v$ for every $v \in V.$

Further, $\mathcal{P}$ is called an orthogonal projection onto $V$ if it also satisfies

$(\mathcal{P}v)'(u - \mathcal{P}u) = (v - \mathcal{P}v)'\mathcal{P}u = 0$ $\forall\, v, u \in \mathbb{R}^n$.

Observe that for a projection matrix $\mathcal{P}$, of any $v \in \mathbb{R}^n$

\mathcal{P}^2 v = \mathcal{P}\mathcal{P}v = \mathcal{P}v \qquad \text{since } \mathcal{P}v \in V;

$(\mathcal{P}v)'(v - \mathcal{P}v) = 0$

$\mathcal{P}$ is “idempotent”. In fact, any idempotent matrix is a projection matrix, and any symmetric idempotent matrix is an orthogonal projection matrix.

Verify that $\mathcal{P}_X$ is an orthogonal projection matrix onto $\text{col}(X)$.

Suppose $v \in \mathbb{R}^n$. Then
$\mathcal{P}_X v = X(X'X)^{-1}X'v \in \text{col}(X)$
Suppose $v \in \text{col}(X)$. Then $v = Xa$ for some $a \in \mathbb{R}^p$, and so
$\mathcal{P}_X v = X(X'X)^{-1}X'v = X(X'X)^{-1}X'Xa = Xa = v$
Let $v, u \in \mathbb{R}^n$. Then
$(\mathcal{P}_X v)'(u - \mathcal{P}_X u) = v'(\mathcal{P}_X u - \mathcal{P}_X \mathcal{P}_X u) = 0 \] \[ (v - \mathcal{P}_X v)'\mathcal{P}_X u = (\mathcal{P}_X' v - \mathcal{P}_X' \mathcal{P}_X v)'\, u = 0$
since $\mathcal{P}_X' = \bigl(X(X'X)^{-1}X'\bigr)' = X(X'X)^{-1}X' = \mathcal{P}_X$

Example. Let $X = \mathbf{1}_n$. Then $\text{col}(X)$ is the subspace of vectors that have all components equal to the same value. That being true, projecting onto $\text{col}(X)$ is a projection of a vector to a single value.

\mathcal{P}_X = X(X'X)^{-1}X' = \tfrac{1}{n}\, XX' = \tfrac{1}{n} \begin{pmatrix} 1 & \cdots & 1 \\ \vdots & \ddots & \vdots \\ 1 & \cdots & 1 \end{pmatrix}

Then for any $y \in \mathbb{R}^n$,

\mathcal{P}_X y = \tfrac{1}{n} \begin{pmatrix} \displaystyle\sum_i y_i \\ \vdots \\ \displaystyle\sum_i y_i \end{pmatrix} = \bar{y}_n \cdot \mathbf{1}_n

$\square$

What happens if $\text{rank}(X) < p$ since in this case $(X'X)^{-1}$ does not exist?

Consider the orthogonal diagonalization $XX' = UDU' \in \mathbb{R}^{n \times n}$ and use the fact that $\text{col}(X) = \text{col}(\widetilde{U})$, where if $k := \text{rank}(X)$ then

\widetilde{U} := (U_1, \ldots, U_k)

and $U_i$ for $i \in \{1, \ldots, k\}$ corresponds to the $i$th nonzero eigenvalue of $XX'$. Then

\mathcal{P}_X = \widetilde{U}(\widetilde{U}'\widetilde{U})^{-1}\widetilde{U}' = \widetilde{U}\widetilde{U}'.

Gram-Schmidt Orthonormalization

Here we study a procedure for how to construct a set of mutually orthogonal vectors from a set of linearly independent vectors. Suppose that $x_1, \ldots, x_p \in \mathbb{R}^n$ are linearly independent. The result of the Gram-Schmidt orthonormalization procedure is a set of orthonormal vectors $u_1, \ldots, u_p \in \mathbb{R}^n$ such that

\text{span}\{u_1, \ldots, u_p\} = \text{span}\{x_1, \ldots, x_p\}.

In matrix form with $U := (u_1, \ldots, u_p)$ and $X := (x_1, \ldots, x_p)$, that is

\text{col}(U) = \text{col}(X).

How to construct $u_1$?

Take $u_1 := x_1$

Next, for $u_2$, modify $x_2$ so that it is orthogonal to $u_1$. Such a vector is in the orthogonal complement of the span of $u_1$,

Take $u_2 := (I_n - \mathcal{P}_{u_1})\, x_2$

Verify that

u_2' u_1 = x_2'(I_n - \mathcal{P}_{u_1})\, u_1 = 0.

Continuing on, the third vector must be orthogonal to both $u_1$ and $u_2$. That is,

Take $u_3 := (I_n - \mathcal{P}_{u_1} - \mathcal{P}_{u_2})\, x_3$

so that

\begin{aligned} u_3' u_2 &= x_3'(I_n - \mathcal{P}_{u_1} - \mathcal{P}_{u_2})\, u_2 \\ &= -x_3' \mathcal{P}_{u_1} u_2 \\ &= -x_3' u_1 (u_1' u_1)^{-1} u_1' u_2 \\ &= 0, \end{aligned}

and

\begin{aligned} u_3' u_1 &= x_3'(I_n - \mathcal{P}_{u_1} - \mathcal{P}_{u_2})\, u_1 \\ &= -x_3' \mathcal{P}_{u_2} u_1 \\ &= -x_3' u_2 (u_2' u_2)^{-1} u_2' u_1 \\ &= 0. \end{aligned}

Accordingly,

\begin{aligned} u_1 &= x_1 \\[.5em] u_2 &= x_2 - \frac{u_1 u_1' x_2}{\|u_1\|^2} \\[.5em] u_3 &= x_3 - \frac{u_1 u_1' x_3}{\|u_1\|^2} - \frac{u_2 u_2' x_3}{\|u_2\|^2} \\ &\vdots \end{aligned}

More concisely, for $j \in \{1, \ldots, n\}$,

u_j = x_j - \left(\sum_{k=1}^{j-1} \frac{u_k u_k'}{\|u_k\|^2}\right) x_j.

So now we have an orthogonal set $\{u_1, \ldots, u_p\}$. How to show that,

\text{span}\{u_1, \ldots, u_p\} = \text{span}\{x_1, \ldots, x_p\}\text{ ?}

Since this is equivalent to $\text{col}(U) = \text{col}(X)$, express

X = (u_1, \ldots, u_p) \cdot \underbrace{ \begin{pmatrix} 1 & \dfrac{u_1' x_2}{\|u_1\|^2} & \dfrac{u_1' x_3}{\|u_1\|^2} & \dfrac{u_1' x_4}{\|u_1\|^2} & \cdots & \dfrac{u_1' x_p}{\|u_1\|^2} \\[10pt] 0 & 1 & \dfrac{u_2' x_3}{\|u_2\|^2} & \dfrac{u_2' x_4}{\|u_2\|^2} & \cdots & \dfrac{u_2' x_p}{\|u_2\|^2} \\[10pt] 0 & 0 & 1 & \dfrac{u_3' x_4}{\|u_3\|^2} & \cdots & \dfrac{u_3' x_p}{\|u_3\|^2} \\[10pt] 0 & 0 & 0 & 1 & \cdots & \dfrac{u_4' x_p}{\|u_4\|^2} \\[6pt] \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\[4pt] 0 & 0 & 0 & 0 & \cdots & 1 \end{pmatrix} }_{=:\, S}

= \underbrace{\left(\frac{u_1}{\|u_1\|}, \ldots, \frac{u_p}{\|u_p\|}\right)}_{=:\, Q} \cdot \underbrace{\begin{pmatrix} \|u_1\| & & \\ & \ddots & \\ & & \|u_p\| \end{pmatrix} \cdot S}_{=:\, R}

Note that $X = QR$ is commonly called the “QR decomposition” for an orthogonal matrix $Q$, and an upper triangular matrix $R$. This expression also yields the Cholesky decomposition of $X'X$,

\[ X'X = R'Q'QR = R'R, \]

where $R$ is upper triangular. Think about the uniqueness of these decompositions.

The General Linear Model

Suppose $x_1, \ldots, x_p \in \mathbb{R}$ are covariates/features with a meaningful interpretation in some setting such that a random variable $Y$ is generated as

Y = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p + U

for some coefficients $\beta_0, \beta_1, \ldots, \beta_p \in \mathbb{R}$ and some random variable $U$ with $E(U) = 0$. This is regarded as the general linear model, and repeated sampling from it leads to a sample of size $n$, where

\begin{aligned} Y_1 &= \beta_0 + \beta_1 x_{1,1} + \cdots + \beta_p x_{p,1} + U_1 \\[.3em] &\quad\vdots \\[.3em] Y_n &= \beta_0 + \beta_1 x_{1,n} + \cdots + \beta_p x_{p,n} + U_n \end{aligned}

Equivalently,

Y = X\beta + U

where $Y \in \mathbb{R}^n$, $X \in \mathbb{R}^{n \times p}$, $\beta \in \mathbb{R}^p$, and $U \in \mathbb{R}^n$. Note that only the vector $Y$ is commonly understood as the data. The matrix $X$ is referred to as the design matrix, and $\beta$ is regarded as fixed but unknown.

The statistical inference problem is to estimate the unknown parameters $\beta$. A natural candidate for an estimate of $\beta$ is the more general idea of the least squares solution

\underset{\beta}{\operatorname{argmin}}\{Q(\beta)\} = \{\beta : X'X\beta = X'y\}

If we propose this estimator, then we need to study its sampling distribution. For example, if $U \sim N_n(0, \sigma^2 I_n)$, then

\hat{\beta} \sim N_p(\beta,\, \sigma^2(X'X)^{-1})

if $\text{rank}(X) = p$.

Next,

\hat{y} := X\hat{\beta} \in \text{col}(X).

If we decompose

y = X\hat{\beta} + \underbrace{y - X\hat{\beta}}_{=:\, \hat{e}},

it turns out that $\hat{e} \in \text{null}(X')$ since

X'\hat{e} = X'y - X'X\hat{\beta} = 0.

Moreover, note that this decomposition is unique because $\text{col}(X) \perp \text{null}(X')$, and

\begin{aligned} \|y\|^2 &= (X\hat{\beta} + \hat{e})'(X\hat{\beta} + \hat{e}) \\ &= \|X\hat{\beta}\|^2 + 2\hat{\beta}'X'\hat{e} + \|\hat{e}\|^2 \\ &= \|\hat{y}\|^2 + \|\hat{e}\|^2. \end{aligned}

The uniqueness of this decomposition/Pythagorean theorem implies that for the orthogonal projection matrix $P$ onto $\text{col}(X)$,

y = Py + (I - P)y = X\hat{\beta} + \hat{e}

with $Py = X\hat{\beta}$ and $(I - P)y = \hat{e}$.

Goals:

Determine whether certain functions of parameters are “estimable”.
Construct “unbiased estimators” for the “estimable” functions.

Definition. An estimator $t(y)$ is an unbiased estimator for the scalar $\lambda'\beta$ if and only if $E(t(Y)) = \lambda'\beta$, $\forall\, \beta$. $\square$

Definition. An estimator $t(y)$ is a linear estimator in $y$ if and only if $t(y) = c + a'y$, for some constants $a$ and $c$. $\square$

Definition. A function $\lambda'\beta$ is linearly estimable if and only if there exists a linear unbiased estimator of it. If no such estimator exists, then the function is called nonestimable. $\square$

Theorem. $\lambda'\beta$ is linearly estimable if and only if there exists a vector $a$ such that $E(a'Y) = \lambda'\beta$, $\forall\, \beta$. $\square$

Proof. First suppose that $\exists\, a \in \mathbb{R}^n$ such that $E(a'Y) = \lambda'\beta$, $\forall\, \beta \in \mathbb{R}^p$. Then $t(y) = a'y$ is an unbiased linear estimator of $\lambda'\beta$, $\forall\, \beta$, with $c = 0$. Conversely, if $\lambda'\beta$ is linearly estimable, then there exist constants $a$ and $c$ such that $t(y) = c + a'y$, and $\forall\, \beta$,

\lambda'\beta = E(t(Y)) = c + a'E(Y) = c + a'X\beta.

Choosing $\beta = 0$ demonstrates that $c = 0$, so that $\forall\, \beta \in \mathbb{R}^p$, $t(y) = a'y$ is a linear unbiased estimator of $\lambda'\beta$. $\square$

In words, $\lambda'\beta$ is linearly estimable if and only if it is equal to the expected value of a linear combination of the data. Observe the condition that

E(a'Y) = \lambda'\beta, \quad \forall\, \beta

implies that $\lambda = X'a$. So essentially, linear estimability of a function $\lambda'\beta$ is equivalent to the condition that $\lambda \in \text{col}(X')$.

Example. Suppose $Y = \mathbf{1}_n\beta + U$. Then for $\lambda = 1$, $\lambda'\beta = \beta$. Choosing $a = e_1 \in \mathbb{R}^n$,

E(a'Y) = a'\mathbf{1}_n\beta = \beta,

$\forall\, \beta \in \mathbb{R}$. That being true, $t(y) = y_1$ is a linear unbiased estimator of $\beta$. Probably $y_1$ is not a great estimator of $\beta$ due to high variability, and the discard of the information $y_2, \ldots, y_n$, but the theorem only gives the necessary and sufficient conditions for the existence of a linear estimator. A better choice would be $a = \tfrac{1}{n} \cdot \mathbf{1}_n$. Then $t(y) = a'y = \bar{y}_n$, and

E(\bar{y}_n) = E(a'Y) = \tfrac{1}{n}\mathbf{1}_n'\mathbf{1}_n\beta = \beta.

$\square$

Example. Suppose

Y = \begin{pmatrix} 1 & 1 \\ 1 & 1 \\ 1 & 0 \\ 1 & 0 \end{pmatrix} \underbrace{\begin{pmatrix} \mu \\ \alpha \end{pmatrix}}_{=:\, \beta} + U

To estimate $\alpha$, choose $\lambda = (0, 1)'$ since $\lambda'\beta = \alpha$. For $\alpha$ to be estimable, it must be the case that $\lambda \in \text{col}(X')$, so determine if there exists a solution to

\begin{pmatrix} 1 & 1 & 1 & 1 \\ 1 & 1 & 0 & 0 \end{pmatrix} \begin{pmatrix} a_1 \\ a_2 \\ a_3 \\ a_4 \end{pmatrix} = \begin{pmatrix} 0 \\ 1 \end{pmatrix}.

One solution is $a = (1, 0, -1, 0)'$, and so $t(y) = y_1 - y_3$ is a linear unbiased estimator of $\alpha$.

$\square$

Example. Now let’s consider an example where $\lambda'\beta$ is not linearly estimable.

Y = \begin{pmatrix} 1 & 1 & 0 \\ 1 & 1 & 0 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \end{pmatrix} \underbrace{\begin{pmatrix} \mu \\ \alpha_1 \\ \alpha_2 \end{pmatrix}}_{=:\, \beta} + U

In this case, $\lambda = (0, 1, 0)'$ to estimate $\alpha_1$, but

\begin{pmatrix} 1 & 1 & 1 & 1 \\ 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 \end{pmatrix} \begin{pmatrix} a_1 \\ a_2 \\ a_3 \\ a_4 \end{pmatrix} = \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix}

does not have a solution. Thus, there exists no linear unbiased estimator for $\alpha_1$ because $\lambda \neq X'a$ for any $a \in \mathbb{R}^n$.

$\square$

Notice that the set of vectors $\lambda$ for which $\lambda'\beta$ is linearly estimable actually forms an entire subspace,

\{\lambda \in \mathbb{R}^p : \lambda = X'a \text{ for some } a \in \mathbb{R}^n\} = \text{col}(X').

Moreover, if $\text{rank}(X') = p$, then $\text{col}(X') = \mathbb{R}^p$ and so $\lambda'\beta$ is linearly estimable $\forall\, \lambda \in \mathbb{R}^p$. An important corollary of this discussion is that each component of $\beta$ is linearly estimable (i.e., identifiable) if $X$ has full column rank.

With the notion of linear estimability now established, what are some effective strategies for determining if a function $\lambda'\beta$ is linearly estimable?

Method 1. If $\lambda \in \text{col}(X')$, then $\lambda'\beta$ is estimable. Accordingly, for any basis $\{v_1, \ldots, v_k\}$ for $\text{col}(X')$ determine if the system $v_1 a_1 + \cdots + v_k a_k = \lambda$ has a solution. Equivalently, since $\mathcal{P}_{X'} = X'(XX')^{-1}X$ projections onto $\text{col}(X')$, it suffices to show that either $\mathcal{P}_{X'}\lambda = \lambda$

Method 2. Since $\text{col}(X') \oplus \text{null}(X) = \mathbb{R}^n$, $\lambda \in \text{col}(X')$ if and only if $\lambda \perp \text{null}(X)$. That being so, construct a basis for $\text{null}(X)$, $\{w_1, \ldots, w_{p-k}\}$, and show that

\lambda' w_j = 0 \quad \forall\, j \in \{1, \ldots, p-k\}.

Method 3. If $\lambda'\beta$ can be expressed as a linear combination of $E(Y_1), \ldots, E(Y_n)$, then $\lambda'\beta$ is estimable.

Theorem. If $\lambda'\beta$ is estimable, then the least squares estimator $\lambda'\hat{\beta}$ is the same for all solutions to the normal equations. $\square$

Proof. Let $\hat{\beta}_1, \hat{\beta}_2 \in \{X'X\beta = X'y\}$. Then

\begin{aligned} X'X\hat{\beta}_1 &= X'X\hat{\beta}_2 \\ X'X(\hat{\beta}_1 - \hat{\beta}_2) &= 0, \end{aligned}

and so $\hat{\beta}_1 - \hat{\beta}_2 \in \text{null}(X'X) = \text{null}(X)$. For estimable $\lambda'\beta$, $\lambda \perp \text{null}(X)$ so it follows that $\lambda'(\hat{\beta}_1 - \hat{\beta}_2) = 0$. $\square$

Theorem. The least squares estimator $\lambda'\hat{\beta}$ of an estimable function $\lambda'\beta$ is a linear unbiased estimator of $\lambda'\beta$. $\square$

Proof. First observe that for some $a \in \mathbb{R}^n$ and $\lambda \in \mathbb{R}^p$,

\lambda'\hat{\beta} = a'X(X'X)^{-1}X'y = a'\mathcal{P}_X y.

Hence, $\lambda'\hat{\beta} = v'y$ for some $v \in \mathbb{R}^n$. Next,

E(\lambda'\hat{\beta}) = a'\mathcal{P}_X E(y) = a'\mathcal{P}_X X\beta = a'X\beta = \lambda'\beta.

$\square$

Gauss-Markov Model

The motivation here is to understand additional inferences that ensue from making assumptions on the variance of the errors in the general linear model. We begin with the Gauss-Markov assumptions/model,

Y = X\beta + U,

with $E(U) = 0 \in \mathbb{R}^n$, $\text{Var}(U) = \sigma^2 I_n$, and $\sigma \in \mathbb{R}_+$. In particular, $E(U_i) = 0$ for $i \in \{1, \ldots, n\}$, and the errors are homoskedastic; that is,

\text{Cov}(U_i, U_j) = \begin{cases} \sigma^2 & \text{for } i = j \\ 0 & \text{for } i \neq j \end{cases}.

Example. Suppose that $\lambda'\beta$ is an estimable function. Then for $\hat{\beta} \in \{X'X\beta = X'y\}$,

\begin{aligned} \text{Var}(\lambda'\hat{\beta}) &= \lambda'\,\text{Var}\!\left[(X'X)^{-1}X'Y\right]\lambda \\ &= \lambda'(X'X)^{-1}X'\,\text{Var}(Y)\,X(X'X)^{-1}\lambda \\ &= a'X(X'X)^{-1}X'\,\text{Var}(Y)\,X(X'X)^{-1}X'a, \quad \text{for some } a, \text{ since } \lambda \in \text{col}(X') \\ &= \sigma^2 \cdot a'X(X'X)^{-1}X'X(X'X)^{-1}X'a \\ &= \sigma^2 \cdot a'X(X'X)^{-1}X'a \\ &= \sigma^2\,\lambda'(X'X)^{-1}\lambda. \end{aligned}

$\square$

It turns out that the least squares estimate plays an important role under the Gauss-Markov assumptions, as described by the following result.

Theorem. (Gauss-Markov theorem) Under the Gauss-Markov assumptions, if $\lambda'\beta$ is estimable, then $\lambda'\hat{\beta}$ is the best (minimum variance) linear unbiased estimator (BLUE) of $\lambda'\beta$, where $\hat{\beta} \in \{X'X\beta = X'y\}$. $\square$

Proof. Assume that $c + d'y$ is another unbiased estimator of $\lambda'\beta$. Then, $\forall\, \beta \in \mathbb{R}^p$,

\lambda'\beta = E(c + d'Y) = c + d'X\beta.

Choosing $\beta = 0$ demonstrates that $c = 0$. Accordingly, $\lambda = X'd$, and

\begin{aligned} \text{Var}(c + d'Y) &= \text{Var}(d'Y) \\ &= \text{Var}(\lambda'\hat{\beta} + d'Y - \lambda'\hat{\beta}) \\ &= \text{Var}(\lambda'\hat{\beta}) + 2\,\text{Cov}(\lambda'\hat{\beta},\, d'Y - \lambda'\hat{\beta}) + \text{Var}(d'Y - \lambda'\hat{\beta}). \end{aligned}

Next,

\begin{aligned} \text{Cov}(\lambda'\hat{\beta},\, d'Y - \lambda'\hat{\beta}) &= \text{Cov}\!\left(\lambda'(X'X)^{-1}X'Y,\; d'Y - \lambda'(X'X)^{-1}X'Y\right) \\ &= \lambda'(X'X)^{-1}X'\cdot \text{Var}(Y)\cdot\left(d - X(X'X)^{-1}\lambda\right) \\ &= \sigma^2\,\lambda'(X'X)^{-1}X'\left(d - X(X'X)^{-1}\lambda\right) \\ &= 0. \end{aligned}

Hence,

\text{Var}(c + d'Y) = \text{Var}(\lambda'\hat{\beta}) + \text{Var}(d'Y - \lambda'\hat{\beta}) \geq \text{Var}(\lambda'\hat{\beta})

with equality if and only if

\begin{aligned} 0 &= \text{Var}(d'Y - \lambda'\hat{\beta}) \\ &= \left(d' - \lambda'(X'X)^{-1}X'\right)\text{Var}(Y)\left(d' - \lambda'(X'X)^{-1}X'\right)' \\ &= \sigma^2\left\lVert d - X(X'X)^{-1}\lambda\right\rVert_2^2. \end{aligned}

That is, $\text{Var}(c + d'Y) = \text{Var}(\lambda'\hat{\beta})$ if and only if $d = X(X'X)^{-1}\lambda$, in which case

d'y = \lambda'(X'X)^{-1}X'y = \lambda'\hat{\beta},

so that $\lambda'\hat{\beta}$ is the unique BLUE of $\lambda'\beta$. $\square$

Variance Estimation

Recall the unique decomposition,

\begin{aligned} Y &= \mathcal{P}_X Y + (I - \mathcal{P}_X)Y \\ &= X\hat{\beta} + (I - \mathcal{P}_X)(X\beta + U) \\ &= X\hat{\beta} + (I - \mathcal{P}_X)U \end{aligned}

for any $\hat{\beta} \in \{X'X\beta = X'Y\} = \{X\beta = \mathcal{P}_X Y\}$. Since $\mathcal{P}_X Y$ is related to the estimation of $\beta$, in the decomposition, the intuition is that $(I - \mathcal{P}_X)Y$ is related to the estimation of $\sigma^2$.

Theorem. Under the Gauss-Markov assumptions, $\widehat{\sigma^2} := Y'(I - \mathcal{P}_X)Y / (n - r)$ is an unbiased estimator of $\sigma^2$, where $r := \text{rank}(X) \leq p$. $\square$

Proof.

\begin{aligned} E(\widehat{\sigma^2}) &= \frac{1}{n-r}\,E\!\left[(X\beta + U)'(I - \mathcal{P}_X)(X\beta + U)\right] \\ &= \frac{1}{n-r}\,E\!\left[U'(I - \mathcal{P}_X)U\right] \\ &= \frac{1}{n-r}\,E\!\left[\text{tr}\!\left(U'(I - \mathcal{P}_X)U\right)\right] \\ &= \frac{1}{n-r}\,\text{tr}\!\left[(I - \mathcal{P}_X)E(UU')\right] \\ &= \frac{1}{n-r}\,\text{tr}(I - \mathcal{P}_X)\cdot\sigma^2 \\ &= \sigma^2. \end{aligned}

$\square$

Note that we have not shown that $\sqrt{\widehat{\sigma^2}}$ is an unbiased estimator for $\sigma$.

Implications on Model Selection

Here we make a distinction between the true data-generating model and the model being fit. In the context of a linear model, two discrepancies are:

Underfitting (missing covariates)
Overfitting (redundant covariates)

Misspecification from Underfitting

Consider the true model $Y = X\beta + \eta + U$, against the model used by the practitioner,

Y = X\beta + U.

Using the practitioner’s model, $Y = X\beta + U$, under the Gauss-Markov assumptions results in an omitted-variable bias. Consider the effect on the least squares estimator. For $\lambda \in \text{col}(X')$, where $\lambda = X'a$ for some $a \in \mathbb{R}^n$,

\begin{aligned} E(\lambda'\hat{\beta}) &= \lambda'(X'X)^{-1}X'E(Y) \\ &= \lambda'(X'X)^{-1}X'(X\beta + \eta) \\ &= \lambda'\beta + \lambda'(X'X)^{-1}X'\eta \\ &= \lambda'\beta + \underbrace{a'\mathcal{P}_X\eta}_{\text{misspecification bias}} \end{aligned}

Observe that if $\eta \perp \text{col}(X)$, then $\lambda'\hat{\beta}$ is unbiased. How about estimation of $\sigma^2$?

\begin{aligned} E\!\left(Y'(I - \mathcal{P}_X)Y\right) &= (X\beta+\eta)'(I - \mathcal{P}_X)(X\beta+\eta) + \sigma^2\,\text{tr}(I - \mathcal{P}_X) \\ &= \eta'(I - \mathcal{P}_X)\eta + \sigma^2(n-r) \end{aligned}

That is, if $\eta \in \text{col}(X)$ then $\widehat{\sigma^2}$ is unbiased. In the case that $\eta \in \text{col}(X)$, $\exists\, \delta$ such that

Y = X\beta + \eta + U = X(\beta+\delta) + U.

Overfitting and Multicollinearity

Practitioner’s model: $Y = X\beta + W\gamma + U$, where $\gamma = 0$ in the true data-generating model. Assume that the concatenated matrix $(X, W)$ has full column rank. Then under the Gauss-Markov assumptions,

\begin{pmatrix} \hat{\beta} \\ \hat{\gamma} \end{pmatrix} = \begin{pmatrix} X'X & X'W \\ W'X & W'W \end{pmatrix}^{-1}\begin{pmatrix} X'Y \\ W'Y \end{pmatrix}, \quad \text{with} \quad E\begin{pmatrix} \hat{\beta} \\ \hat{\gamma} \end{pmatrix} = \begin{pmatrix} \beta \\ \gamma \end{pmatrix} = \begin{pmatrix} \beta \\ 0 \end{pmatrix}.

It can be worked out that,

\text{Var}(\hat{\beta}) = \sigma^2(X'X)^{-1} + \sigma^2(X'X)^{-1}X'W\left(W'(I-\mathcal{P}_X)W\right)^{-1}W'X(X'X)^{-1},

where the second term can be understood as the increase in variability of the least squares estimator that results from including redundant covariates. However, estimation of $\sigma^2$ remains unbiased:

E\!\left[Y'(I - \mathcal{P}_{(X,W)})Y\right] = \sigma^2\left(n - \text{rank}(X,W)\right).

Notice that if the columns of $W$ are not only linearly independent but are also orthogonal to the columns of $X$, then the variance penalty is zero. In this case, there is no cost to overfitting. More generally, though, if $W$ is not orthogonal to $X$, then a loss of efficiency from overfitting ensues.

A measure to gauge the influence of multicollinearity relates to the SVD of $(X, W)$.

Definition. The mean squared error (MSE) of an estimator $\hat{\theta}$ for some parameter $\theta$ is

\text{MSE}(\hat{\theta}) := E\left[\lVert \hat{\theta} - \theta \rVert_2^2\right].

$\square$

For some unbiased estimator $\hat{\beta}$ of $\beta$,

\begin{aligned} \text{MSE}(\hat{\beta}) &= E\left[\lVert \hat{\beta} - \beta \rVert_2^2\right] \\ &= \text{tr}\!\left(E\left[(\hat{\beta}-\beta)(\hat{\beta}-\beta)'\right]\right) \\ &= \text{tr}\!\left(\text{Var}(\hat{\beta})\right). \end{aligned}

Further, denoting the SVD of the design matrix as $X = U\Sigma V'$,

\begin{aligned} \text{MSE}(\hat{\beta}) &= \text{tr}\!\left(\text{Var}(\hat{\beta})\right) \\ &= \sigma^2\,\text{tr}\!\left[(X'X)^{-1}\right] \\ &= \sigma^2\,\text{tr}\!\left(V\Sigma^{-2}V'\right) \\ &= \sigma^2\sum_{i=1}^{p}\frac{1}{\lambda_i}, \end{aligned}

where $\lambda_i$ is the $i$th eigenvalue of $X'X$. In the case that multicollinearity is severe, $X'X$ is nearly singular. As such, $\min_i\{\lambda_i\}$ will be close to zero, and so $\text{MSE}(\hat{\beta})$ will be excessively large. Another measure of multicollinearity is the condition number,

\kappa := \frac{\max_i\{\lambda_i\}}{\min_i\{\lambda_i\}}.

Alternatively, there is a residual-sum-of-squares approach to assessing the severity of multicollinearity, called variance inflation factors.

Distributional Theory

To this point we have considered the following assumptions on the general linear model, $Y = X\beta + U$:

$E(U) = 0$ in the context of the least squares problem
$E(U) = 0$ in the context of estimation
$E(U) = 0$ and $\text{Var}(U) = \sigma^2 I_n$
$E(U) = 0$ and $\text{Var}(U) = \sigma^2 V$, for some known positive-definite $V$.

In this section we consider the implications of a distributional assumption on $U$.

Definition. A random vector $Y \in \mathbb{R}^p$ is said to follow the multivariate normal distribution with mean $\mu$ and covariance matrix $\Sigma$ if, $\forall\, v \in \mathbb{R}^p$ such that $v'\Sigma v \neq 0$,

v'Y \sim N(v'\mu,\, v'\Sigma v).

Denote $Y \sim N_p(\mu, \Sigma)$. $\square$

Theorem. A random vector $Y \sim N_p(\mu, \Sigma)$ for some nonsingular matrix $\Sigma$ if and only if $Y$ has density,

f(y) = \det(2\pi\Sigma)^{-1/2}\, e^{-\frac{1}{2}(y-\mu)'\Sigma^{-1}(y-\mu)}.

$\square$

Proof. Exercise. $\square$

Note that if $Z \sim N_p(0, I_p)$, then $Y := \mu + \Sigma^{1/2}Z \sim N_p(\mu, \Sigma)$. Other than the density function, another quantity that uniquely defines the distribution of a random variable is the moment generating function.

Definition. The moment generating function of a random vector $X \in \mathbb{R}^p$ is

m_X(t) := E\!\left(e^{t'X}\right), \quad t \in \mathbb{R}^p

provided that the expectation exists in a neighborhood of $t = 0$. $\square$

Recall two important properties of moment generating functions:

If the moment generating functions for two random variables $X$ and $Y$ exist, then they share identical CDFs if and only if $m_X(t) = m_Y(t)$ for every $t$ in some neighborhood of zero.
Let $X := \begin{pmatrix} X_1 \\ \vdots \\ X_k \end{pmatrix}$ for some $k$. Then $X_1, \ldots, X_k$ are mutually independent if and only if $m_X(t) = m_{X_1}(t_1) \cdots m_{X_k}(t_k)$ for every $t$ in some neighborhood of zero.

Using property (2), if $Z_1, \ldots, Z_p \overset{\text{iid}}{\sim} N(0,1)$, then for $Z := (Z_1, \ldots, Z_p)'$,

\begin{aligned} m_Z(t) &= m_{Z_1}(t_1) \cdots m_{Z_p}(t_p) \\ &= e^{\frac{1}{2}t_1^2} \cdots e^{\frac{1}{2}t_p^2} \\ &= e^{\frac{1}{2}t't}. \end{aligned}

Then if $Y := \mu + \Sigma^{1/2}Z \sim N_p(\mu, \Sigma)$ for $Z \sim N_p(0, I_p)$,

m_Y(t) = E\!\left(e^{t'Y}\right) = e^{t'\mu}\,E\!\left(e^{t'\Sigma^{1/2}Z}\right) = e^{t'\mu}\,m_Z\!\left(\Sigma^{1/2}t\right) = e^{\,t'\mu\, +\, \frac{1}{2}t'\Sigma t}.

More generally, linear transformations of multivariate normal random variables can be summarized as in the following result.

Theorem. If $X \sim N_p(\mu, \Sigma)$ and $Y = a + BX$ for some $a \in \mathbb{R}^q$ and $B \in \mathbb{R}^{q \times p}$, then

Y \sim N_q(a + B\mu,\; B\Sigma B').

$\square$

Proof. Generalize the above argument. $\square$

Next, we will study quadratic forms of multivariate normal random variables.

Definition. Let $Z \sim N_p(0, I_p)$. Then $U := Z'Z = \sum_{i=1}^{p} Z_i^2$ is said to follow the $\chi^2_p$ distribution. The parameter $p$ denotes the degrees of freedom. $\square$

Some properties of the $\chi^2_p$ distribution:

\begin{aligned} m_U(t) &= (1-2t)^{-p/2}, \quad t < \tfrac{1}{2} \\ E(U) &= p \\ \text{Var}(U) &= 2p \end{aligned}

Observe that the variance grows on the same order as the mean.

Definition. Let $J \sim \text{Poisson}(\Phi)$ and $U \mid J = j \sim \chi^2_{p+2j}$. Then (unconditionally) $U$ is said to follow a noncentral chi-squared distribution with noncentrality parameter $\Phi$, denoted $U \sim \chi^2_p(\Phi)$. $\square$

We will see shortly that if $Z_i \sim N(\mu_i, 1)$ independently, then $U := \sum_{i=1}^{p} Z_i^2 \sim \chi^2_p(\Phi)$, where

\Phi = \frac{1}{2}\sum_{i=1}^{p} \mu_i^2.

The density for $U$ can be derived as

\begin{aligned} f_U(u) &= \sum_{j \geq 0} f_{U \mid J}(u \mid j)\cdot f_J(j) \\ &= \sum_{j \geq 0} \frac{u^{\frac{p+2j-2}{2}}\,e^{-u/2}}{\Gamma\!\left(\frac{p+2j}{2}\right) 2^{\,j+p/2}} \cdot \frac{e^{-\Phi}\Phi^{j}}{j!}, \end{aligned}

and the moment generating function is

\begin{aligned} m_U(t) &= (1-2t)^{-p/2}\cdot e^{\frac{2\Phi t}{1-2t}}, \quad t < \tfrac{1}{2} \\ E(U) &= p + 2\Phi \\ \text{Var}(U) &= 2p + 8\Phi \end{aligned}

Theorem. If $U_1, \ldots, U_m$ are mutually independent and $U_i \sim \chi^2_{p_i}(\Phi_i)$, then

U := \sum_{i=1}^{m} U_i \sim \chi^2_p(\Phi),

where $p := \sum_{i=1}^{m} p_i$ and $\Phi := \sum_{i=1}^{m} \Phi_i$. $\square$

Proof.

\begin{aligned} m_U(t) &= E\!\left(e^{tU}\right) \\ &= E\!\left(e^{t\sum_{i=1}^{m} U_i}\right) \\ &= \prod_{i=1}^{m} E\!\left(e^{tU_i}\right) \\ &= (1-2t)^{-\frac{1}{2}\sum_{i=1}^{m} p_i}\, e^{\frac{2t\sum_{i=1}^{m}\Phi_i}{1-2t}}, \quad t < \tfrac{1}{2} \end{aligned}

$\square$

Theorem. If $X \sim N(\mu, 1)$, then $X^2 \sim \chi^2_1\!\left(\tfrac{1}{2}\mu^2\right)$. $\square$

Proof.

\begin{aligned} m_{X^2}(t) &= E\!\left(e^{tX^2}\right) \\ &= \int_{\mathbb{R}} e^{tx^2}\,\frac{1}{\sqrt{2\pi}}\,e^{-\frac{1}{2}(x-\mu)^2}\,dx \end{aligned}

Then “complete the square” to show $m_{X^2}(t) = (1-2t)^{-1/2}\,e^{\frac{2t(\mu^2/2)}{1-2t}}$. $\square$

Theorem. If $X \sim N_p(\mu, I_p)$, then $X'X \sim \chi^2_p\!\left(\tfrac{1}{2}\mu'\mu\right)$. $\square$

Proof. Since $\text{Cov}(X_i, X_j) = 0$ for every $i \neq j$, joint normality gives independence. Thus, by the previous theorem,

X_i^2 \sim \chi^2_1\!\left(\tfrac{1}{2}\mu_i^2\right), \quad \forall\, i \in \{1, \ldots, p\},

and so, by the two previous theorems, since $X_1^2, \ldots, X_p^2$ are independent,

X'X = \sum_{i=1}^{p} X_i^2 \sim \chi^2_p\!\left(\tfrac{1}{2}\mu'\mu\right).

$\square$

Corollary. If $X \sim N_p(\mu, V)$ for some nonsingular $V$, then $X'V^{-1}X \sim \chi^2_p\!\left(\tfrac{1}{2}\mu'V^{-1}\mu\right)$. $\square$

Proof. Rescale $X$ and apply the previous theorem. $\square$

Definition. Let $U_1$ and $U_2$ be independent random variables with $U_1 \sim \chi^2_{p_1}$ and $U_2 \sim \chi^2_{p_2}$. Then the $F$ distribution is defined as $F_{p_1,p_2} := \dfrac{U_1/p_1}{U_2/p_2}$. $\square$

Definition. Let $U_1$ and $U_2$ be independent random variables with $U_1 \sim \chi^2_{p_1}(\Phi)$ and $U_2 \sim \chi^2_{p_2}$. Then the noncentral $F$ distribution is defined as $F_{p_1,p_2}(\Phi) := \dfrac{U_1/p_1}{U_2/p_2}$. $\square$

Definition. Let $X \sim N(\mu, 1)$ and $U \sim \chi^2_p$, with $X$ and $U$ independent. Then the $T$ distribution is defined as $T_p := \dfrac{X}{\sqrt{U/p}}$. $\square$

Now we are ready to consider the Gaussian linear model,

Y = X\beta + U,

where $U \sim N(0, \sigma^2 I_n)$. The goal for the remainder of this chapter is to study the distributions of the components of

Y = \mathcal{P}_X Y + (I - \mathcal{P}_X)Y.

These are all quadratic forms of multivariate normal random vectors with projection matrices.

Quadratic Forms in the Gaussian Linear Model

Lemma. A $p \times p$ matrix $A$ is symmetric and idempotent with rank $s$ if and only if there exists a $p \times s$ matrix $G$ with orthonormal columns (i.e., $G'G = I_s$) such that $GG' = A$. $\square$

Proof. Exercise. $\square$

Theorem. Let $Y \sim N_p(\mu, I_p)$. If $A$ is a symmetric and idempotent matrix with $\text{rank}(A) = s$, then

Y'AY \sim \chi^2_s\!\left(\tfrac{1}{2}\mu'A\mu\right).

$\square$

Proof. By the previous lemma, there exists a $p \times s$ matrix $G$ with $G'G = I_s$ such that $GG' = A$. Then $G'Y \sim N_s(G'\mu, I_s)$, so that, by a previous theorem,

Y'AY = Y'GG'Y = (G'Y)'(G'Y) \sim \chi^2_s\!\left(\tfrac{1}{2}\mu'GG'\mu\right) = \chi^2_s\!\left(\tfrac{1}{2}\mu'A\mu\right).

$\square$

Note that for the general linear model, $Y \sim N(X\beta, \sigma^2 I_n)$, so $\sigma^{-1}Y \sim N(\sigma^{-1}X\beta, I_n)$. Since $I - \mathcal{P}_X$ is symmetric and idempotent with rank $n - r$,

\sigma^{-2}\,Y'(I - \mathcal{P}_X)Y = (\sigma^{-1}Y)'(I - \mathcal{P}_X)(\sigma^{-1}Y) \sim \chi^2_{n-r},

with noncentrality parameter zero, since $(I - \mathcal{P}_X)X\beta = 0$. Similarly, since $\mathcal{P}_X$ is symmetric and idempotent with rank $r$,

\sigma^{-2}\,Y'\mathcal{P}_X Y \sim \chi^2_r\!\left(\frac{\beta'X'X\beta}{2\sigma^2}\right).

Moreover,

\begin{pmatrix} \hat{Y} \\ \hat{U} \end{pmatrix} \sim N\!\left(\begin{pmatrix} X\beta \\ 0 \end{pmatrix},\; \sigma^2\begin{pmatrix} \mathcal{P}_X & 0 \\ 0 & I - \mathcal{P}_X \end{pmatrix}\right).

Since $\hat{Y}$ and $\hat{U}$ are jointly normal and have zero covariance, they are independent, and so

\frac{Y'\mathcal{P}_X Y / r}{Y'(I - \mathcal{P}_X)Y/(n-r)} \sim F_{r,\,n-r}\!\left(\frac{\beta'X'X\beta}{2\sigma^2}\right).

Theorem. (Cochran’s theorem) Let $Y \sim N(\mu, \sigma^2 I_n)$, and let $A_1, \ldots, A_k$ be symmetric idempotent matrices with $\text{rank}(A_i) = s_i$ for $i \in \{1, \ldots, k\}$. If $\sum_{i=1}^{k} A_i = I_n$, then

\sigma^{-2}\,Y'A_iY \sim \chi^2_{s_i}\!\left(\tfrac{1}{2\sigma^2}\mu'A_i\mu\right), \quad i = 1, \ldots, k,

where $n = \sum_{i=1}^{k} s_i$, and $\sigma^{-2}Y'A_1Y, \ldots, \sigma^{-2}Y'A_kY$ are independent. $\square$

Proof. See a linear algebra book. $\square$

Consider again the Gaussian linear model, $Y \sim N(X\beta, \sigma^2 I_n)$. Up to this point, we have shown that the BLUE for an estimable $\lambda'\beta$ is $\lambda'\hat{\beta}$, for $\hat{\beta} \in \{X'X\beta = X'y\}$, and that

\lambda'\hat{\beta} \sim N\!\left(\lambda'\beta,\; \sigma^2\lambda'(X'X)^{-1}\lambda\right),

independent of

\widehat{\sigma^2} := \frac{Y'(I - \mathcal{P}_X)Y}{n - r},

where $\dfrac{(n-r)\,\widehat{\sigma^2}}{\sigma^2} \sim \chi^2_{n-r}$.

Theorem. In the model $Y \sim N(X\beta, \sigma^2 I_n)$, with unknown parameters $\beta$ and $\sigma^2$, the maximum likelihood estimators are $\hat{\beta} = (X'X)^{-1}X'y$ and $\widehat{\sigma^2} = \dfrac{1}{n}Y'(I - \mathcal{P}_X)Y$, respectively. $\square$

Proof. The log-likelihood of the data is given by

\ell(\beta, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{Q(\beta)}{2\sigma^2},

where $Q(\beta) := (y - X\beta)'(y - X\beta)$. Maximizing with respect to $\beta$ is equivalent to minimizing $Q(\beta)$ with respect to $\beta$. Accordingly, the MLE of $\beta$ is

\hat{\beta} := \operatorname*{argmin}_{\beta}\; Q(\beta) = (X'X)^{-1}X'y.

Next,

\frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{Q(\hat{\beta})}{2\sigma^4} = 0

gives

\widehat{\sigma^2} = \frac{Q(\hat{\beta})}{n} = \frac{Y'(I - \mathcal{P}_X)Y}{n}.

(Left to verify the second-order conditions.) $\square$

Bootstrapping

The basic idea of bootstrapping techniques is to subsample from an observed data set to approximate the sampling distribution of a statistic. The motivation for why this is a reasonable idea comes from Glivenko–Cantelli theory, which establishes that, under usual conditions, the empirical distribution function (EDF) will approximate the CDF.

Definition. The EDF is defined, for a sample $x_1, \ldots, x_n$, as

\widehat{F}_n(x) = \frac{1}{n}\sum_{i=1}^{n} \mathbb{1}\{x_i \le x\},

equivalently, as the step function

\widehat{F}_n(x) = \begin{cases} 0, & x < x_{(1)} \\ \dfrac{1}{n}, & x_{(1)} \le x < x_{(2)} \\ \dfrac{2}{n}, & x_{(2)} \le x < x_{(3)} \\ \;\;\vdots & \\ 1, & x \ge x_{(n)} \end{cases}

where $x_{(1)}, \ldots, x_{(n)}$ are the order statistics of the sample. The corresponding empirical density function places a point mass of $\tfrac{1}{n}$ at each observed value,

\widehat{f}_n(x) = \frac{1}{n}\sum_{i=1}^{n} \delta(x - x_i).

$\square$

Definition. A bootstrapped sample of size $n$ is defined as

X_1^*, \ldots, X_n^* \overset{\text{iid}}{\sim} \widehat{F}_n.

Note that bootstrapped sampling is sampling with replacement. $\square$

Consider, in R, an example of bootstrapping to approximate the sampling distribution of the sample mean of both Gaussian and Cauchy data. In both cases, compare to confidence intervals from $t$-distributed critical values.

Next, consider how to bootstrap the sampling distributions of the coefficients in the simple linear regression model,

y_i = \beta_0 + \beta_1 x_i + u_i \qquad \text{for } i \in \{1, \ldots, n\}.

Note that the errors are the stochastic component to be resampled, because the $y_i$ and $x_i$ come in pairs. There are two obvious approaches for bootstrapping the errors.

Approach 1

Since the $u_1, \ldots, u_n$ are unobservable, we could instead think of the pairs

(x_i, y_i) \overset{\text{iid}}{\sim} f_{X,Y},

for some joint distribution $f_{X,Y}$. As such, bootstrap samples as

(x_i^*, y_i^*) \overset{\text{iid}}{\sim} \widehat{f}_{X,Y}, \qquad i = 1, \ldots, n.

Remember that this is still sampling with replacement. Next, compute the bootstrapped coefficient estimates as

\hat{\beta}_1^{*} = \frac{\sum_{i=1}^{n}(x_i^{*} - \bar{x}^{*})(y_i^{*} - \bar{y}^{*})}{\sum_{i=1}^{n}(x_i^{*} - \bar{x}^{*})^2} \qquad \text{and} \qquad \hat{\beta}_0^{*} = \bar{y}^{*} - \hat{\beta}_1^{*}\bar{x}^{*}.

Repeat this procedure some $N$ number of times to approximate the sampling distributions of $\hat{\beta}_0$ and $\hat{\beta}_1$.

(Illustrate in R.)

Approach 2

Since the $u_1, \ldots, u_n$ are unobservable, we could instead estimate the errors as

\hat{u}_i = y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i, \qquad i = 1, \ldots, n.

Then bootstrap the $n$ response values as

y_i^{*} = \hat{\beta}_0 + \hat{\beta}_1 x_i + u_i^{*}, \qquad u_i^{*} \overset{\text{iid}}{\sim} \widehat{F}_{\hat{u}}, \qquad i = 1, \ldots, n.

Then compute the bootstrapped coefficient estimates $\hat{\beta}_0^{*}$ and $\hat{\beta}_1^{*}$ as above. Repeat this procedure some $N$ number of times to approximate the sampling distributions of $\hat{\beta}_0$ and $\hat{\beta}_1$.

(Illustrate in R.)

Observe that there is a subtle problem with sampling $u_i^{*} \sim \text{Uniform}\{\hat{u}_1, \ldots, \hat{u}_n\}$, in that

\hat{u} = (I - \mathcal{P}_X)U,

so that $E[\hat{u}] = 0$, but

\text{Var}(\hat{u}) = \sigma^2(I - \mathcal{P}_X).

Accordingly, $E[\hat{u}_i] = 0$ and $\text{Var}(\hat{u}_i) = \sigma^2(1 - h_i)$, where $h_i := [\mathcal{P}_X]_{ii}$. This is problematic because $u_1, \ldots, u_n$ are assumed to be i.i.d. with mean zero and a common variance — a condition the raw residuals $\hat{u}_i$ do not satisfy, since their variance depends on $h_i$. That being so, a correction is achieved by bootstrapping the $n$ response values as

y_i^{*} = \hat{\beta}_0 + \hat{\beta}_1 x_i + u_i^{*}, \qquad u_i^{*} \sim \text{Uniform}\{\tilde{u}_1, \ldots, \tilde{u}_n\}, \qquad i = 1, \ldots, n,

where

\tilde{u}_i := \frac{\hat{u}_i}{\sqrt{1 - h_i}}, \qquad i = 1, \ldots, n, \qquad h_i := [\mathcal{P}_X]_{ii}.

Then repeat the above procedure.

(Illustrate in R.)