Linear algebra is the study of vector spaces, linear transformations, matrices, and inner product spaces.

Definition. A vector space \(V\) over a field \(F\) is a set of elements \(v \in V\), with the operations of addition and scalar multiplication, that satisfy the following axioms.

  1. \(\forall\, x, y \in V,\quad x + y = y + x\)
  2. \(\forall\, x, y, z \in V,\quad (x + y) + z = x + (y + z)\)
  3. There exists an element “\(0\)” \(\in V\) s.t. \(x + 0 = x\), \(\forall\, x \in V\)
  4. \(\forall\, x \in V\; \exists\, y \in V\) s.t. \(x + y = 0\)
  5. There exists an element “\(1\)” \(\in F\) s.t. \(1 \cdot x = x\), \(\forall\, x \in V\)
  6. \(\forall\, a, b \in F,\; \forall\, x \in V,\quad (ab)x = a(bx)\)
  7. \(\forall\, a \in F,\; \forall\, x, y \in V,\quad a(x + y) = ax + ay\)
  8. \(\forall\, a, b \in F,\; \forall\, x \in V,\quad (a + b)x = ax + bx\) \(\square\)

In statistics contexts, typically \(V = \mathbb{R}^n\) and \(F = \mathbb{R}\).

The notion of a vector space is truly foundational for data science. A vector space gives us an entire framework and intuition for thinking about data objects (even as data are stored as objects in a computer).

Definition. A collection of vectors \(x_1, \ldots, x_p\) are said to be linearly dependent if and only if there exist scalars \(c_1, \ldots, c_p\), not all zero, such that

\[ \sum_{i=1}^{p} c_i x_i = 0. \]

A set of vectors that are *not* linearly dependent are said to be linearly independent. \(\square\)

Often, to show that a set of vectors are linearly independent, a good strategy is to demonstrate that \(c_1 x_1 + \cdots + c_p x_p = 0\) implies that \(c_1 = \cdots = c_p = 0\).

Note that if \(x_1, \ldots, x_p \in \mathbb{R}^n\), then there can be at most \(\min\{n, p\}\) linearly independent of the \(p\) vectors.

Definition. A basis, \(B\), for a vector space, \(V\), is a linearly independent subset of \(V\) with \(\text{span}(B) = V.\) \(\square\)

Example. For \(V = \mathbb{R}^n\), the set \(\{e_1, \ldots, e_n\}\), where \(e_i \in \mathbb{R}^n\) has \(i\)th component with value 1 and 0 for all other components, is a basis. This is commonly called the standard basis, and \(e_i\) a standard basis vector. \(\square\)

Theorem. Let \(V\) be a vector space and \(B := \{u_1, \ldots, u_n\} \subseteq V\). Then \(B\) is a basis for \(V\) if and only if every \(v \in V\) can be expressed uniquely as

\[ v = a_1 u_1 + \cdots + a_n u_n \]

for some scalars \(a_1, \ldots, a_n\). \(\square\)

Proof. Left as an exercise. \(\square\)

Theorem. If a vector space \(V\) is generated by some finite set \(S\), then some subset of \(S\) is a basis for \(V\). \(\square\)

Proof. Left as an exercise. \(\square\)

Definition. Let \(V\) and \(W\) be vector spaces over field \(\mathbb{R}\). We call a function \(T : V \to W\) a linear transformation from \(V\) to \(W\) if \(\forall\, x, y \in V\) and \(\forall\, c \in \mathbb{R}\),

  1. \(T(x + y) = T(x) + T(y)\)
  2. \(T(cx) = cT(x).\) \(\square\)

Example. Let \(A \in \mathbb{R}^{p \times q}\) and define \(T(x) := Ax\) \(\forall\, x \in \mathbb{R}^q\). Then \(T : \mathbb{R}^q \to \mathbb{R}^p\) is a linear transformation since

  1. \(\forall\, x, y \in \mathbb{R}^q,\quad T(x + y) = A(x + y) = Ax + Ay = T(x) + T(y)\), and
  2. \(\forall\, x \in \mathbb{R}^q,\; \forall\, c \in \mathbb{R},\quad T(cx) = A \cdot (cx) = cAx = cT(x).\) \(\square\)

Example. Let \(V = C(\mathbb{R})\), the vector space of continuous real-valued functions defined on \(\mathbb{R}\). Then \(\forall\, a, b \in \mathbb{R}\) with \(a < b\), the transformation \(T : V \to \mathbb{R}\) defined by,

\[ T(f) := \int_a^b f(x)\, dx \]

is linear. \(\square\)

Definition. Let \(V\) and \(W\) be vector spaces, and \(T : V \to W\) a linear transformation. The set,

\[ \text{null}(T) := \{x \in V : T(x) = 0\}, \]

is called the null space of \(T\). The range of \(T\) is defined as,

\[ \text{range}(T) := \{y \in W : T(x) = y \text{ for some } x \in V\}. \]

Further, \(\text{rank}(T) := \dim(\text{range}(T))\). \(\square\)

Theorem. (dimension theorem) Let \(V\) and \(W\) be vector spaces, and \(T : V \to W\) be linear. If \(V\) is finite-dimensional, then

\[ \dim(\text{null}(T)) + \text{rank}(T) = \dim(V). \]
\(\square\)

Proof. See any linear algebra textbook. \(\square\)

Recall that if \(T(x) := Ax\), then \(\text{range}(T)\) is called the column space of \(A\), denoted \(\text{col}(A)\). Similarly for \(\text{range}(T')\), called the row space of \(A\), and denoted by \(\text{row}(A)\).

Definition. Let \(A \in \mathbb{R}^{p \times q}\) and \(B \in \mathbb{R}^{q \times m}\). Then the matrix product, denoted by \(AB\), has components defined as

\[ (AB)_{ij} := \sum_{k=1}^{q} A_{ik} B_{kj} \]

for \(i \in \{1, \ldots, p\}\) and \(j \in \{1, \ldots, m\}\). \(\square\)

Further notions for matrices.

  1. The transpose of \(A\), denoted \(A'\), satisfies the property that \((A')_{ij} = A_{ji}\) for all indices \(i, j\).
  2. \(A \in \mathbb{R}^{p \times q}\) is called square if \(p = q\).
  3. A square matrix is called symmetric if \(A' = A\).
  4. \(A \in \mathbb{R}^{p \times p}\) is said to be invertible if \(\exists\, B \in \mathbb{R}^{p \times p}\) s.t. \(AB = I_p = BA\). In that case, \(A^{-1} := B\).
  5. \(A \in \mathbb{R}^{p \times p}\), the trace of \(A\) is defined as \(\displaystyle\text{tr}(A) := \sum_{i=1}^{p} A_{ii}\).

Selected properties of matrices.

  1. \((AB)' = B'A'\)
  2. \(A \in \mathbb{R}^{p \times p}\) is invertible if and only if \(\text{rank}(A) = p\) (equivalently \(\det(A) \neq 0\)).
  3. If both \(A, B \in \mathbb{R}^{p \times p}\) are invertible, then \((AB)^{-1} = B^{-1}A^{-1}\).
  4. For \(A\) and \(B\) of appropriate dimensions, \(\text{tr}(AB) = \text{tr}(BA)\).

Definition. Let \(A \in \mathbb{R}^{p \times p}\). A nonzero vector \(v \in \mathbb{R}^p\) is called an eigenvector of \(A\) if \(Av = \lambda v\) for some scalar \(\lambda\), called an eigenvalue. \(\square\)

Theorem. A scalar \(\lambda\) is an eigenvalue of a matrix \(A \in \mathbb{R}^{p \times p}\) if and only if \(\det(A - \lambda I_p) = 0\). The polynomial \(f(t) := \det(A - tI_p)\) is called the characteristic polynomial of \(A\). \(\square\)

Proof. See any linear algebra textbook. \(\square\)

For example, if

\[ A = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix}, \]

then \(\forall\, v \in \mathbb{R}^3\), \(Av = 1 \cdot v\). How about

\[ A = \begin{pmatrix} 1 & 2 & 3 \\ 0 & 4 & 5 \\ 0 & 0 & 6 \end{pmatrix}? \]
\[ f(t) = \det(A - tI_3) \] \[ = \det \begin{pmatrix} 1-t & 2 & 3 \\ 0 & 4-t & 5 \\ 0 & 0 & 6-t \end{pmatrix} \] \[ = (1-t)\det\begin{pmatrix} 4-t & 5 \\ 0 & 6-t \end{pmatrix} - 0 + 0 \] \[ = (1-t)\left[(4-t)(6-t) - 0\right] \] \[ = (1-t)(4-t)(6-t) \]

which is a cubic polynomial with roots \(t = 1\), \(t = 4\), \(t = 6\).

How to find the eigenvectors of

\[ A = \begin{pmatrix} 1 & 2 & 3 \\ 0 & 4 & 5 \\ 0 & 0 & 6 \end{pmatrix}, \]

and verify that \(1, 4, 6\) are the eigenvalues of \(A\)?

Observe that the eigenvectors must satisfy the system of linear equations

\[ Av = \lambda v \quad \text{or} \quad (A - \lambda I)v = 0 \]
\[ \begin{aligned} (1 - \lambda)v_1 + 2v_2 + 3v_3 &= 0 \\ (4 - \lambda)v_2 + 5v_3 &= 0 \\ (6 - \lambda)v_3 &= 0 \end{aligned} \]

For \(\lambda = 6\),

\[ \left. \begin{aligned} -5v_1 + 2v_2 + 3v_3 &= 0 \\ -2v_2 + 5v_3 &= 0 \\ 0 &= 0 \end{aligned} \right\} \text{WLOG take } v_3 = 1 \]
\[ v_2 = \tfrac{5}{2} \] \[ v_1 = \tfrac{2}{5}v_2 + \tfrac{3}{5} = \tfrac{8}{5} \]

To verify,

\[ A \cdot v = \begin{pmatrix} 1 & 2 & 3 \\ 0 & 4 & 5 \\ 0 & 0 & 6 \end{pmatrix} \begin{pmatrix} 8/5 \\ 5/2 \\ 1 \end{pmatrix} = \begin{pmatrix} 48/5 \\ 15 \\ 6 \end{pmatrix} = 6 \cdot \begin{pmatrix} 8/5 \\ 5/2 \\ 1 \end{pmatrix} = \lambda v \]

In general, for a system of linear equations in row echelon form,

\[ \begin{pmatrix} M_{11} & M_{12} & \cdots & M_{1p} \\ 0 & M_{22} & \cdots & M_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & M_{pp} \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_p \end{pmatrix} = \begin{pmatrix} b_1 \\ b_2 \\ \vdots \\ b_p \end{pmatrix} \]

Suppose \(M_{ij} \neq 0\) \(\forall\, j \geq i\). Then

\[ x_p = \frac{b_p}{M_{pp}} \] \[ \vdots \] \[ x_2 = \left(b_2 - \sum_{j=3}^{p} M_{2j} x_j\right) \frac{1}{M_{22}} \] \[ x_1 = \left(b_1 - \sum_{j=2}^{p} M_{1j} x_j\right) \frac{1}{M_{11}} \]

In general, this is called back substitution

\[ x_i = \left(b_i - \sum_{j=i+1}^{p} M_{ij} x_j\right) \frac{1}{M_{ii}} \qquad \text{for } i \in \{1, \ldots, p\} \]

Eigenvalues have an important role in matrix decompositions for certain classes of matrices. In particular, consider the following fundamental result for the class of symmetric matrices

Spectral Theorem. For any symmetric matrix \(A \in \mathbb{R}^{p \times p}\) there exists an orthogonal matrix \(Q\) (i.e., \(Q'Q = QQ' = I_p\)) such that,

\[ A = QDQ', \]

where \(D\) is a diagonal matrix composed of the eigenvalues of \(A\). \(\square\)

Note also that the columns of \(Q\) form an orthonormal basis for \(\mathbb{R}^p\) consisting of eigenvectors of \(A\).

The basis part means that every vector \(v \in \mathbb{R}^p\) can be expressed as

\[ v = \sum_{i=1}^{p} Q_i a_i, \]

for some \(a_1, \ldots, a_p \in \mathbb{R}\), and that \(Q_1, \ldots, Q_p\) are linearly independent.

Recall the definition of linear independence:

Vectors \(u_1, \ldots, u_n \in \mathbb{R}^p\) are said to be linearly independent if

\[ \sum_{i=1}^{n} u_i a_i = 0 \quad \text{implies that} \quad a_1 = \cdots = a_n = 0 \]

A set of vectors are said to be linearly dependent if they are not linearly independent.

Orthonormal means \(\forall\, i, j \in \{1, \ldots, p\}\)

\[ Q_i' Q_j = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{if } i \neq j \end{cases}. \]

Next, verify that the columns \(Q_1, \ldots, Q_p\) are eigenvectors of \(A\) corresponding to eigenvalues given by the diagonal components of \(D\). Take any \(i \in \{1, \ldots, p\}\):

\[ AQ_i = QDQ'Q_i = QD \begin{pmatrix} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{pmatrix} = Q \begin{pmatrix} 0 \\ \vdots \\ D_{ii} \\ \vdots \\ 0 \end{pmatrix} = D_{ii} \cdot Q_i \]

Thus, \(Q_i\) is an eigenvector of \(A\), by definition, and by the same definition \(D_{ii}\) is an eigenvalue.

Further properties: \(\displaystyle\text{tr}(A) = \sum_{i=1}^{p} \lambda_i\) and \(\displaystyle\det(A) = \prod_{i=1}^{p} \lambda_i\).

\[ \text{tr}(A) = \text{tr}(QDQ') = \text{tr}(Q'QD) = \text{tr}(D) = \sum_{i=1}^{p} D_{ii} \]
\[ \det(A) = \det(QDQ') = \det(Q) \cdot \det(D) \cdot \det(Q') = \det(QQ') \cdot \det(D) = 1 \cdot \prod_{i=1}^{p} D_{ii} \]

What if a matrix is not symmetric nor even square? How to decompose?

Let \(X \in \mathbb{R}^{n \times p}\). Then \(X'X \in \mathbb{R}^{p \times p}\) and \(XX' \in \mathbb{R}^{n \times n}\), and both are symmetric:

\[ (X'X)' = X'(X')' = X'X \] \[ (XX')' = (X')'\, X' = XX' \]

Accordingly, we can apply the Spectral theorem to both \(X'X\) and \(XX'\) to decompose as

\[ X'X = V\Delta_1 V' \qquad \text{and} \qquad XX' = U\Delta_2 U' \]

for matrices \(V \in \mathbb{R}^{p \times p}\) and \(U \in \mathbb{R}^{n \times n}\), both orthogonal matrices, and diagonal matrices \(\Delta_1\) and \(\Delta_2\).

Moreover, it can be shown that \(X = U\Sigma V'\), where \(\Sigma\) is a diagonal matrix with nonzero diagonal components equal to the square roots of the nonzero diagonal components of both \(\Delta_1\) and \(\Delta_2\). The diagonal components of \(\Sigma\) are called the singular values of \(X\). For symmetric matrices \(A \in \mathbb{R}^{p \times p}\) with singular values \(\sigma_1, \ldots, \sigma_p\) and eigenvalues \(\lambda_1, \ldots, \lambda_p\),

\[ \sigma_i = |\lambda_i| \quad \text{for } i \in \{1, \ldots, p\} \]

SVD and other decompositions are important for data compression and dimension reduction. Consider this for \(X \in \mathbb{R}^{n \times p}\) with \(\text{rank}(X) = r\),

\[ \begin{aligned} X &= U\Sigma V' = (U_1, \ldots, U_n) \begin{pmatrix} \sigma_1 & & & \\ & \ddots & & \\ & & \sigma_r & \\ & & & 0 \\ & & & & \ddots \\ & & & & & 0 \end{pmatrix} \begin{pmatrix} V_1' \\ \vdots \\ V_p' \end{pmatrix} \\[1.5em] &= (U_1, \ldots, U_r) \begin{pmatrix} \sigma_1 & & \\ & \ddots & \\ & & \sigma_r \end{pmatrix} \begin{pmatrix} V_1' \\ \vdots \\ V_r' \end{pmatrix} \quad \text{“compact SVD”} \end{aligned} \]

where \(U_1, \ldots, U_n \in \mathbb{R}^n\) and \(V_1, \ldots, V_r \in \mathbb{R}^p\). Storing raw \(X\) requires \(n \cdot p\) pieces of information. Storing the SVD requires

\[ n \cdot r + r + p \cdot r = (n + p + 1) \cdot r \]

pieces of information. For example, suppose \(n = 100\), \(p = 30\). Then

\[ (n + p + 1) \cdot r = 131 \cdot r < 3000 = n \cdot p \quad \text{if } r < 23. \]

Furthermore, if \(\sigma_k, \ldots, \sigma_r \approx 0\), for some \(k \leq r\), then

\[ X \approx (U_1, \ldots, U_k) \begin{pmatrix} \sigma_1 & & \\ & \ddots & \\ & & \sigma_k \end{pmatrix} \begin{pmatrix} V_1' \\ \vdots \\ V_k' \end{pmatrix} \]

which allows for even less memory requirements.

Least Squares Problems

Suppose that \(X \in \mathbb{R}^{n \times p}\) and \(y \in \mathbb{R}^n\). Consider the conditions in which the system

\[ Xb = y \]

has a solution for some vector \(b \in \mathbb{R}^p\). As we have seen, to determine whether \(y \in \text{col}(X)\),

In the special case that \(p = n\) and \(\text{rank}(X) = n\), \(b = X^{-1}y\)

How about the system when \(p < n\)?

If the system \(Xb = y\) does not have a solution then perhaps the next best thing would be to find the \(b \in \mathbb{R}^p\) which makes \(Xb\) as “close” as possible to \(y\).

\[ Q(b) := \|y - Xb\|^2. \]

The \(\underset{b}{\text{argmin}}\{Q(b)\}\) is called the least squares solution.

Desirable properties of \(Q(b)\):

  1. Convex
  2. Differentiable
  3. \(\underset{b}{\text{argmin}}\{Q(b)\} = \{b : \nabla_b Q(b) = 0\}\)

Recall that for a function \(f : \mathbb{R}^p \to \mathbb{R}\),

\[ \nabla_x f(x) := \begin{pmatrix} \dfrac{\partial f(x)}{\partial x_1} \\[6pt] \vdots \\[2pt] \dfrac{\partial f(x)}{\partial x_p} \end{pmatrix} \in \mathbb{R}^p. \]

Lemma. For \(a, b \in \mathbb{R}^p\) and \(A \in \mathbb{R}^{p \times p}\), the following properties hold.

  1. \(\nabla_b(a'b) = a\)
  2. \(\nabla_b(b'Ab) = (A + A')b\) \(\square\)

Proof. See any multivariate calculus textbook. \(\square\)

Observing the fact that

\[ Q(\beta) = (y - X\beta)'(y - X\beta) = y'y - y'X\beta - \beta'X'y + \beta'X'X\beta, \]

it follows by the lemma that

\[ \nabla_\beta Q(\beta) = -2X'y + (X'X + X'X)\beta = -2X'(y - X\beta). \]

Setting \(\nabla_\beta Q(\beta) = 0\) yields the “normal equations”

\[ X'X\beta = X'y. \]

Example. Recall a simple linear regression,

\[ y_i = \beta_0 + \beta_1 x_i + u_i \qquad \text{for } i \in \{1, \ldots, n\}. \]

In this case,

\[ y = X\beta + u, \qquad \text{where} \quad X := \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}. \]

Then

\[ X'X\beta = \begin{pmatrix} n & \displaystyle\sum_i x_i \\[6pt] \displaystyle\sum_i x_i & \displaystyle\sum_i x_i^2 \end{pmatrix} \beta \qquad \text{and} \qquad X'y = \begin{pmatrix} \displaystyle\sum_i y_i \\[6pt] \displaystyle\sum_i x_i y_i \end{pmatrix}. \]

Accordingly, the normal equations are

\[ n\beta_0 + n\bar{x}_n \beta_1 = n\bar{y}_n \] \[ n\bar{x}_n \beta_0 + \sum_i x_i^2\, \beta_1 = \sum_i x_i y_i, \]

and so the least squares solution for \(\beta_0\) and \(\beta_1\) are

\[ \begin{aligned} \hat{\beta}_0 &= \bar{y}_n - \bar{x}_n \hat{\beta}_1 \\[1em] \hat{\beta}_1 &= \frac{\displaystyle\sum_i y_i(x_i - \bar{x}_n)}{\displaystyle\sum_i (x_i - \bar{x}_n)^2}. \end{aligned} \]

\(\square\)

Moreover, it can be mathematically proven that there always exists a solution to the normal equations. In particular, when \(\text{rank}(X) = p\), the unique least squares solution has the form

\[ \hat{b} = (X'X)^{-1} X'y. \]

This solution turns out to be the coefficients corresponding to the orthogonal projection of \(y\) onto \(\text{col}(X)\):

\[ \hat{y} = X\hat{b} = \underbrace{X(X'X)^{-1}X'}_{=:\, \mathcal{P}_X} y \]

A linear transformation \(\mathcal{P} : \mathbb{R}^n \to \mathbb{R}^n\) is said to be a projection onto some subspace \(V \subseteq \mathbb{R}^n\) if

  1. \(\mathcal{P}v \in V\) for every \(v \in \mathbb{R}^n\)
  2. \(\mathcal{P}v = v\) for every \(v \in V.\)

Further, \(\mathcal{P}\) is called an orthogonal projection onto \(V\) if it also satisfies

  1. \((\mathcal{P}v)'(u - \mathcal{P}u) = (v - \mathcal{P}v)'\mathcal{P}u = 0\) \(\forall\, v, u \in \mathbb{R}^n\).

Observe that for a projection matrix \(\mathcal{P}\), of any \(v \in \mathbb{R}^n\)

\[ \mathcal{P}^2 v = \mathcal{P}\mathcal{P}v = \mathcal{P}v \qquad \text{since } \mathcal{P}v \in V; \]
0 v 𝒫v
\((\mathcal{P}v)'(v - \mathcal{P}v) = 0\)

\(\mathcal{P}\) is “idempotent”. In fact, any idempotent matrix is a projection matrix, and any symmetric idempotent matrix is an orthogonal projection matrix.

Verify that \(\mathcal{P}_X\) is an orthogonal projection matrix onto \(\text{col}(X)\).

  1. Suppose \(v \in \mathbb{R}^n\). Then

    \[ \mathcal{P}_X v = X(X'X)^{-1}X'v \in \text{col}(X) \]
  2. Suppose \(v \in \text{col}(X)\). Then \(v = Xa\) for some \(a \in \mathbb{R}^p\), and so

    \[ \mathcal{P}_X v = X(X'X)^{-1}X'v = X(X'X)^{-1}X'Xa = Xa = v \]
  3. Let \(v, u \in \mathbb{R}^n\). Then

    \[ (\mathcal{P}_X v)'(u - \mathcal{P}_X u) = v'(\mathcal{P}_X u - \mathcal{P}_X \mathcal{P}_X u) = 0 \] \[ (v - \mathcal{P}_X v)'\mathcal{P}_X u = (\mathcal{P}_X' v - \mathcal{P}_X' \mathcal{P}_X v)'\, u = 0 \]

    since \(\mathcal{P}_X' = \bigl(X(X'X)^{-1}X'\bigr)' = X(X'X)^{-1}X' = \mathcal{P}_X\)

Example. Let \(X = \mathbf{1}_n\). Then \(\text{col}(X)\) is the subspace of vectors that have all components equal to the same value. That being true, projecting onto \(\text{col}(X)\) is a projection of a vector to a single value.

\[ \mathcal{P}_X = X(X'X)^{-1}X' = \tfrac{1}{n}\, XX' = \tfrac{1}{n} \begin{pmatrix} 1 & \cdots & 1 \\ \vdots & \ddots & \vdots \\ 1 & \cdots & 1 \end{pmatrix} \]

Then for any \(y \in \mathbb{R}^n\),

\[ \mathcal{P}_X y = \tfrac{1}{n} \begin{pmatrix} \displaystyle\sum_i y_i \\ \vdots \\ \displaystyle\sum_i y_i \end{pmatrix} = \bar{y}_n \cdot \mathbf{1}_n \]

\(\square\)

What happens if \(\text{rank}(X) < p\) since in this case \((X'X)^{-1}\) does not exist?

Consider the orthogonal diagonalization \(XX' = UDU' \in \mathbb{R}^{n \times n}\) and use the fact that \(\text{col}(X) = \text{col}(\widetilde{U})\), where if \(k := \text{rank}(X)\) then

\[ \widetilde{U} := (U_1, \ldots, U_k) \]

and \(U_i\) for \(i \in \{1, \ldots, k\}\) corresponds to the \(i\)th nonzero eigenvalue of \(XX'\). Then

\[ \mathcal{P}_X = \widetilde{U}(\widetilde{U}'\widetilde{U})^{-1}\widetilde{U}' = \widetilde{U}\widetilde{U}'. \]

Gram-Schmidt Orthonormalization

Here we study a procedure for how to construct a set of mutually orthogonal vectors from a set of linearly independent vectors. Suppose that \(x_1, \ldots, x_p \in \mathbb{R}^n\) are linearly independent. The result of the Gram-Schmidt orthonormalization procedure is a set of orthonormal vectors \(u_1, \ldots, u_p \in \mathbb{R}^n\) such that

\[ \text{span}\{u_1, \ldots, u_p\} = \text{span}\{x_1, \ldots, x_p\}. \]

In matrix form with \(U := (u_1, \ldots, u_p)\) and \(X := (x_1, \ldots, x_p)\), that is

\[ \text{col}(U) = \text{col}(X). \]

How to construct \(u_1\)?

Next, for \(u_2\), modify \(x_2\) so that it is orthogonal to \(u_1\). Such a vector is in the orthogonal complement of the span of \(u_1\),

Verify that

\[ u_2' u_1 = x_2'(I_n - \mathcal{P}_{u_1})\, u_1 = 0. \]

Continuing on, the third vector must be orthogonal to both \(u_1\) and \(u_2\). That is,

so that

\[ \begin{aligned} u_3' u_2 &= x_3'(I_n - \mathcal{P}_{u_1} - \mathcal{P}_{u_2})\, u_2 \\ &= -x_3' \mathcal{P}_{u_1} u_2 \\ &= -x_3' u_1 (u_1' u_1)^{-1} u_1' u_2 \\ &= 0, \end{aligned} \]

and

\[ \begin{aligned} u_3' u_1 &= x_3'(I_n - \mathcal{P}_{u_1} - \mathcal{P}_{u_2})\, u_1 \\ &= -x_3' \mathcal{P}_{u_2} u_1 \\ &= -x_3' u_2 (u_2' u_2)^{-1} u_2' u_1 \\ &= 0. \end{aligned} \]

Accordingly,

\[ \begin{aligned} u_1 &= x_1 \\[.5em] u_2 &= x_2 - \frac{u_1 u_1' x_2}{\|u_1\|^2} \\[.5em] u_3 &= x_3 - \frac{u_1 u_1' x_3}{\|u_1\|^2} - \frac{u_2 u_2' x_3}{\|u_2\|^2} \\ &\vdots \end{aligned} \]

More concisely, for \(j \in \{1, \ldots, n\}\),

\[ u_j = x_j - \left(\sum_{k=1}^{j-1} \frac{u_k u_k'}{\|u_k\|^2}\right) x_j. \]

So now we have an orthogonal set \(\{u_1, \ldots, u_p\}\). How to show that,

\[ \text{span}\{u_1, \ldots, u_p\} = \text{span}\{x_1, \ldots, x_p\}\text{ ?} \]

Since this is equivalent to \(\text{col}(U) = \text{col}(X)\), express

\[ X = (u_1, \ldots, u_p) \cdot \underbrace{ \begin{pmatrix} 1 & \dfrac{u_1' x_2}{\|u_1\|^2} & \dfrac{u_1' x_3}{\|u_1\|^2} & \dfrac{u_1' x_4}{\|u_1\|^2} & \cdots & \dfrac{u_1' x_p}{\|u_1\|^2} \\[10pt] 0 & 1 & \dfrac{u_2' x_3}{\|u_2\|^2} & \dfrac{u_2' x_4}{\|u_2\|^2} & \cdots & \dfrac{u_2' x_p}{\|u_2\|^2} \\[10pt] 0 & 0 & 1 & \dfrac{u_3' x_4}{\|u_3\|^2} & \cdots & \dfrac{u_3' x_p}{\|u_3\|^2} \\[10pt] 0 & 0 & 0 & 1 & \cdots & \dfrac{u_4' x_p}{\|u_4\|^2} \\[6pt] \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\[4pt] 0 & 0 & 0 & 0 & \cdots & 1 \end{pmatrix} }_{=:\, S} \]
\[ = \underbrace{\left(\frac{u_1}{\|u_1\|}, \ldots, \frac{u_p}{\|u_p\|}\right)}_{=:\, Q} \cdot \underbrace{\begin{pmatrix} \|u_1\| & & \\ & \ddots & \\ & & \|u_p\| \end{pmatrix} \cdot S}_{=:\, R} \]

Note that \(X = QR\) is commonly called the “QR decomposition” for an orthogonal matrix \(Q\), and an upper triangular matrix \(R\). This expression also yields the Cholesky decomposition of \(X'X\),

\[ X'X = R'Q'QR = R'R, \]

where \(R\) is upper triangular. Think about the uniqueness of these decompositions.

The General Linear Model

Suppose \(x_1, \ldots, x_p \in \mathbb{R}\) are covariates/features with a meaningful interpretation in some setting such that a random variable \(Y\) is generated as

\[ Y = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p + U \]

for some coefficients \(\beta_0, \beta_1, \ldots, \beta_p \in \mathbb{R}\) and some random variable \(U\) with \(E(U) = 0\). This is regarded as the general linear model, and repeated sampling from it leads to a sample of size \(n\), where

\[ \begin{aligned} Y_1 &= \beta_0 + \beta_1 x_{1,1} + \cdots + \beta_p x_{p,1} + U_1 \\[.3em] &\quad\vdots \\[.3em] Y_n &= \beta_0 + \beta_1 x_{1,n} + \cdots + \beta_p x_{p,n} + U_n \end{aligned} \]

Equivalently,

\[ Y = X\beta + U \]

where \(Y \in \mathbb{R}^n\), \(X \in \mathbb{R}^{n \times p}\), \(\beta \in \mathbb{R}^p\), and \(U \in \mathbb{R}^n\). Note that only the vector \(Y\) is commonly understood as the data. The matrix \(X\) is referred to as the design matrix, and \(\beta\) is regarded as fixed but unknown.

The statistical inference problem is to estimate the unknown parameters \(\beta\). A natural candidate for an estimate of \(\beta\) is the more general idea of the least squares solution

\[ \underset{\beta}{\operatorname{argmin}}\{Q(\beta)\} = \{\beta : X'X\beta = X'y\} \]

If we propose this estimator, then we need to study its sampling distribution. For example, if \(U \sim N_n(0, \sigma^2 I_n)\), then

\[ \hat{\beta} \sim N_p(\beta,\, \sigma^2(X'X)^{-1}) \]

if \(\text{rank}(X) = p\).