Linear algebra is the study of vector spaces, linear transformations, matrices, and inner product spaces.
Definition. A vector space \(V\) over a field \(F\) is a set of elements \(v \in V\), with the operations of addition and scalar multiplication, that satisfy the following axioms.
- \(\forall\, x, y \in V,\quad x + y = y + x\)
- \(\forall\, x, y, z \in V,\quad (x + y) + z = x + (y + z)\)
- There exists an element “\(0\)” \(\in V\) s.t. \(x + 0 = x\), \(\forall\, x \in V\)
- \(\forall\, x \in V\; \exists\, y \in V\) s.t. \(x + y = 0\)
- There exists an element “\(1\)” \(\in F\) s.t. \(1 \cdot x = x\), \(\forall\, x \in V\)
- \(\forall\, a, b \in F,\; \forall\, x \in V,\quad (ab)x = a(bx)\)
- \(\forall\, a \in F,\; \forall\, x, y \in V,\quad a(x + y) = ax + ay\)
- \(\forall\, a, b \in F,\; \forall\, x \in V,\quad (a + b)x = ax + bx\) \(\square\)
In statistics contexts, typically \(V = \mathbb{R}^n\) and \(F = \mathbb{R}\).
The notion of a vector space is truly foundational for data science. A vector space gives us an entire framework and intuition for thinking about data objects (even as data are stored as objects in a computer).
Definition. A collection of vectors \(x_1, \ldots, x_p\) are said to be linearly dependent if and only if there exist scalars \(c_1, \ldots, c_p\), not all zero, such that
A set of vectors that are *not* linearly dependent are said to be linearly independent. \(\square\)
Often, to show that a set of vectors are linearly independent, a good strategy is to demonstrate that \(c_1 x_1 + \cdots + c_p x_p = 0\) implies that \(c_1 = \cdots = c_p = 0\).
Note that if \(x_1, \ldots, x_p \in \mathbb{R}^n\), then there can be at most \(\min\{n, p\}\) linearly independent of the \(p\) vectors.
Definition. A basis, \(B\), for a vector space, \(V\), is a linearly independent subset of \(V\) with \(\text{span}(B) = V.\) \(\square\)
Example. For \(V = \mathbb{R}^n\), the set \(\{e_1, \ldots, e_n\}\), where \(e_i \in \mathbb{R}^n\) has \(i\)th component with value 1 and 0 for all other components, is a basis. This is commonly called the standard basis, and \(e_i\) a standard basis vector. \(\square\)
Theorem. Let \(V\) be a vector space and \(B := \{u_1, \ldots, u_n\} \subseteq V\). Then \(B\) is a basis for \(V\) if and only if every \(v \in V\) can be expressed uniquely as
for some scalars \(a_1, \ldots, a_n\). \(\square\)
Proof. Left as an exercise. \(\square\)
Theorem. If a vector space \(V\) is generated by some finite set \(S\), then some subset of \(S\) is a basis for \(V\). \(\square\)
Proof. Left as an exercise. \(\square\)
Definition. Let \(V\) and \(W\) be vector spaces over field \(\mathbb{R}\). We call a function \(T : V \to W\) a linear transformation from \(V\) to \(W\) if \(\forall\, x, y \in V\) and \(\forall\, c \in \mathbb{R}\),
- \(T(x + y) = T(x) + T(y)\)
- \(T(cx) = cT(x).\) \(\square\)
Example. Let \(A \in \mathbb{R}^{p \times q}\) and define \(T(x) := Ax\) \(\forall\, x \in \mathbb{R}^q\). Then \(T : \mathbb{R}^q \to \mathbb{R}^p\) is a linear transformation since
- \(\forall\, x, y \in \mathbb{R}^q,\quad T(x + y) = A(x + y) = Ax + Ay = T(x) + T(y)\), and
- \(\forall\, x \in \mathbb{R}^q,\; \forall\, c \in \mathbb{R},\quad T(cx) = A \cdot (cx) = cAx = cT(x).\) \(\square\)
Example. Let \(V = C(\mathbb{R})\), the vector space of continuous real-valued functions defined on \(\mathbb{R}\). Then \(\forall\, a, b \in \mathbb{R}\) with \(a < b\), the transformation \(T : V \to \mathbb{R}\) defined by,
is linear. \(\square\)
Definition. Let \(V\) and \(W\) be vector spaces, and \(T : V \to W\) a linear transformation. The set,
is called the null space of \(T\). The range of \(T\) is defined as,
Further, \(\text{rank}(T) := \dim(\text{range}(T))\). \(\square\)
Theorem. (dimension theorem) Let \(V\) and \(W\) be vector spaces, and \(T : V \to W\) be linear. If \(V\) is finite-dimensional, then
Proof. See any linear algebra textbook. \(\square\)
Recall that if \(T(x) := Ax\), then \(\text{range}(T)\) is called the column space of \(A\), denoted \(\text{col}(A)\). Similarly for \(\text{range}(T')\), called the row space of \(A\), and denoted by \(\text{row}(A)\).
Definition. Let \(A \in \mathbb{R}^{p \times q}\) and \(B \in \mathbb{R}^{q \times m}\). Then the matrix product, denoted by \(AB\), has components defined as
for \(i \in \{1, \ldots, p\}\) and \(j \in \{1, \ldots, m\}\). \(\square\)
Further notions for matrices.
- The transpose of \(A\), denoted \(A'\), satisfies the property that \((A')_{ij} = A_{ji}\) for all indices \(i, j\).
- \(A \in \mathbb{R}^{p \times q}\) is called square if \(p = q\).
- A square matrix is called symmetric if \(A' = A\).
- \(A \in \mathbb{R}^{p \times p}\) is said to be invertible if \(\exists\, B \in \mathbb{R}^{p \times p}\) s.t. \(AB = I_p = BA\). In that case, \(A^{-1} := B\).
- \(A \in \mathbb{R}^{p \times p}\), the trace of \(A\) is defined as \(\displaystyle\text{tr}(A) := \sum_{i=1}^{p} A_{ii}\).
Selected properties of matrices.
- \((AB)' = B'A'\)
- \(A \in \mathbb{R}^{p \times p}\) is invertible if and only if \(\text{rank}(A) = p\) (equivalently \(\det(A) \neq 0\)).
- If both \(A, B \in \mathbb{R}^{p \times p}\) are invertible, then \((AB)^{-1} = B^{-1}A^{-1}\).
- For \(A\) and \(B\) of appropriate dimensions, \(\text{tr}(AB) = \text{tr}(BA)\).
Definition. Let \(A \in \mathbb{R}^{p \times p}\). A nonzero vector \(v \in \mathbb{R}^p\) is called an eigenvector of \(A\) if \(Av = \lambda v\) for some scalar \(\lambda\), called an eigenvalue. \(\square\)
Theorem. A scalar \(\lambda\) is an eigenvalue of a matrix \(A \in \mathbb{R}^{p \times p}\) if and only if \(\det(A - \lambda I_p) = 0\). The polynomial \(f(t) := \det(A - tI_p)\) is called the characteristic polynomial of \(A\). \(\square\)
Proof. See any linear algebra textbook. \(\square\)
For example, if
then \(\forall\, v \in \mathbb{R}^3\), \(Av = 1 \cdot v\). How about
which is a cubic polynomial with roots \(t = 1\), \(t = 4\), \(t = 6\).
How to find the eigenvectors of
and verify that \(1, 4, 6\) are the eigenvalues of \(A\)?
Observe that the eigenvectors must satisfy the system of linear equations
For \(\lambda = 6\),
To verify,
In general, for a system of linear equations in row echelon form,
Suppose \(M_{ij} \neq 0\) \(\forall\, j \geq i\). Then
In general, this is called back substitution
Eigenvalues have an important role in matrix decompositions for certain classes of matrices. In particular, consider the following fundamental result for the class of symmetric matrices
Spectral Theorem. For any symmetric matrix \(A \in \mathbb{R}^{p \times p}\) there exists an orthogonal matrix \(Q\) (i.e., \(Q'Q = QQ' = I_p\)) such that,
where \(D\) is a diagonal matrix composed of the eigenvalues of \(A\). \(\square\)
Note also that the columns of \(Q\) form an orthonormal basis for \(\mathbb{R}^p\) consisting of eigenvectors of \(A\).
The basis part means that every vector \(v \in \mathbb{R}^p\) can be expressed as
for some \(a_1, \ldots, a_p \in \mathbb{R}\), and that \(Q_1, \ldots, Q_p\) are linearly independent.
Recall the definition of linear independence:
Vectors \(u_1, \ldots, u_n \in \mathbb{R}^p\) are said to be linearly independent if
A set of vectors are said to be linearly dependent if they are not linearly independent.
Orthonormal means \(\forall\, i, j \in \{1, \ldots, p\}\)
Next, verify that the columns \(Q_1, \ldots, Q_p\) are eigenvectors of \(A\) corresponding to eigenvalues given by the diagonal components of \(D\). Take any \(i \in \{1, \ldots, p\}\):
Thus, \(Q_i\) is an eigenvector of \(A\), by definition, and by the same definition \(D_{ii}\) is an eigenvalue.
Further properties: \(\displaystyle\text{tr}(A) = \sum_{i=1}^{p} \lambda_i\) and \(\displaystyle\det(A) = \prod_{i=1}^{p} \lambda_i\).
What if a matrix is not symmetric nor even square? How to decompose?
Let \(X \in \mathbb{R}^{n \times p}\). Then \(X'X \in \mathbb{R}^{p \times p}\) and \(XX' \in \mathbb{R}^{n \times n}\), and both are symmetric:
Accordingly, we can apply the Spectral theorem to both \(X'X\) and \(XX'\) to decompose as
for matrices \(V \in \mathbb{R}^{p \times p}\) and \(U \in \mathbb{R}^{n \times n}\), both orthogonal matrices, and diagonal matrices \(\Delta_1\) and \(\Delta_2\).
Moreover, it can be shown that \(X = U\Sigma V'\), where \(\Sigma\) is a diagonal matrix with nonzero diagonal components equal to the square roots of the nonzero diagonal components of both \(\Delta_1\) and \(\Delta_2\). The diagonal components of \(\Sigma\) are called the singular values of \(X\). For symmetric matrices \(A \in \mathbb{R}^{p \times p}\) with singular values \(\sigma_1, \ldots, \sigma_p\) and eigenvalues \(\lambda_1, \ldots, \lambda_p\),
SVD and other decompositions are important for data compression and dimension reduction. Consider this for \(X \in \mathbb{R}^{n \times p}\) with \(\text{rank}(X) = r\),
where \(U_1, \ldots, U_n \in \mathbb{R}^n\) and \(V_1, \ldots, V_r \in \mathbb{R}^p\). Storing raw \(X\) requires \(n \cdot p\) pieces of information. Storing the SVD requires
pieces of information. For example, suppose \(n = 100\), \(p = 30\). Then
Furthermore, if \(\sigma_k, \ldots, \sigma_r \approx 0\), for some \(k \leq r\), then
which allows for even less memory requirements.
Least Squares Problems
Suppose that \(X \in \mathbb{R}^{n \times p}\) and \(y \in \mathbb{R}^n\). Consider the conditions in which the system
has a solution for some vector \(b \in \mathbb{R}^p\). As we have seen, to determine whether \(y \in \text{col}(X)\),
- Gaussian elimination to express system in row echelon form.
- Back substitution algorithm
In the special case that \(p = n\) and \(\text{rank}(X) = n\), \(b = X^{-1}y\)
How about the system when \(p < n\)?
- \(X\) is tall and narrow (i.e., more rows than columns)
- Likely, \(y \notin \text{col}(X)\)
- This is the context of least squares problems
If the system \(Xb = y\) does not have a solution then perhaps the next best thing would be to find the \(b \in \mathbb{R}^p\) which makes \(Xb\) as “close” as possible to \(y\).
The \(\underset{b}{\text{argmin}}\{Q(b)\}\) is called the least squares solution.
Desirable properties of \(Q(b)\):
- Convex
- Differentiable
- \(\underset{b}{\text{argmin}}\{Q(b)\} = \{b : \nabla_b Q(b) = 0\}\)
Recall that for a function \(f : \mathbb{R}^p \to \mathbb{R}\),
Lemma. For \(a, b \in \mathbb{R}^p\) and \(A \in \mathbb{R}^{p \times p}\), the following properties hold.
- \(\nabla_b(a'b) = a\)
- \(\nabla_b(b'Ab) = (A + A')b\) \(\square\)
Proof. See any multivariate calculus textbook. \(\square\)
Observing the fact that
it follows by the lemma that
Setting \(\nabla_\beta Q(\beta) = 0\) yields the “normal equations”
Example. Recall a simple linear regression,
In this case,
Then
Accordingly, the normal equations are
and so the least squares solution for \(\beta_0\) and \(\beta_1\) are
\(\square\)
Moreover, it can be mathematically proven that there always exists a solution to the normal equations. In particular, when \(\text{rank}(X) = p\), the unique least squares solution has the form
This solution turns out to be the coefficients corresponding to the orthogonal projection of \(y\) onto \(\text{col}(X)\):
A linear transformation \(\mathcal{P} : \mathbb{R}^n \to \mathbb{R}^n\) is said to be a projection onto some subspace \(V \subseteq \mathbb{R}^n\) if
- \(\mathcal{P}v \in V\) for every \(v \in \mathbb{R}^n\)
- \(\mathcal{P}v = v\) for every \(v \in V.\)
Further, \(\mathcal{P}\) is called an orthogonal projection onto \(V\) if it also satisfies
- \((\mathcal{P}v)'(u - \mathcal{P}u) = (v - \mathcal{P}v)'\mathcal{P}u = 0\) \(\forall\, v, u \in \mathbb{R}^n\).
Observe that for a projection matrix \(\mathcal{P}\), of any \(v \in \mathbb{R}^n\)
\(\mathcal{P}\) is “idempotent”. In fact, any idempotent matrix is a projection matrix, and any symmetric idempotent matrix is an orthogonal projection matrix.
Verify that \(\mathcal{P}_X\) is an orthogonal projection matrix onto \(\text{col}(X)\).
-
Suppose \(v \in \mathbb{R}^n\). Then
\[ \mathcal{P}_X v = X(X'X)^{-1}X'v \in \text{col}(X) \] -
Suppose \(v \in \text{col}(X)\). Then \(v = Xa\) for some \(a \in \mathbb{R}^p\), and so
\[ \mathcal{P}_X v = X(X'X)^{-1}X'v = X(X'X)^{-1}X'Xa = Xa = v \] -
Let \(v, u \in \mathbb{R}^n\). Then
\[ (\mathcal{P}_X v)'(u - \mathcal{P}_X u) = v'(\mathcal{P}_X u - \mathcal{P}_X \mathcal{P}_X u) = 0 \] \[ (v - \mathcal{P}_X v)'\mathcal{P}_X u = (\mathcal{P}_X' v - \mathcal{P}_X' \mathcal{P}_X v)'\, u = 0 \]since \(\mathcal{P}_X' = \bigl(X(X'X)^{-1}X'\bigr)' = X(X'X)^{-1}X' = \mathcal{P}_X\)
Example. Let \(X = \mathbf{1}_n\). Then \(\text{col}(X)\) is the subspace of vectors that have all components equal to the same value. That being true, projecting onto \(\text{col}(X)\) is a projection of a vector to a single value.
Then for any \(y \in \mathbb{R}^n\),
\(\square\)
What happens if \(\text{rank}(X) < p\) since in this case \((X'X)^{-1}\) does not exist?
Consider the orthogonal diagonalization \(XX' = UDU' \in \mathbb{R}^{n \times n}\) and use the fact that \(\text{col}(X) = \text{col}(\widetilde{U})\), where if \(k := \text{rank}(X)\) then
and \(U_i\) for \(i \in \{1, \ldots, k\}\) corresponds to the \(i\)th nonzero eigenvalue of \(XX'\). Then
Gram-Schmidt Orthonormalization
Here we study a procedure for how to construct a set of mutually orthogonal vectors from a set of linearly independent vectors. Suppose that \(x_1, \ldots, x_p \in \mathbb{R}^n\) are linearly independent. The result of the Gram-Schmidt orthonormalization procedure is a set of orthonormal vectors \(u_1, \ldots, u_p \in \mathbb{R}^n\) such that
In matrix form with \(U := (u_1, \ldots, u_p)\) and \(X := (x_1, \ldots, x_p)\), that is
How to construct \(u_1\)?
- Take \(u_1 := x_1\)
Next, for \(u_2\), modify \(x_2\) so that it is orthogonal to \(u_1\). Such a vector is in the orthogonal complement of the span of \(u_1\),
- Take \(u_2 := (I_n - \mathcal{P}_{u_1})\, x_2\)
Verify that
Continuing on, the third vector must be orthogonal to both \(u_1\) and \(u_2\). That is,
- Take \(u_3 := (I_n - \mathcal{P}_{u_1} - \mathcal{P}_{u_2})\, x_3\)
so that
and
Accordingly,
More concisely, for \(j \in \{1, \ldots, n\}\),
So now we have an orthogonal set \(\{u_1, \ldots, u_p\}\). How to show that,
Since this is equivalent to \(\text{col}(U) = \text{col}(X)\), express
Note that \(X = QR\) is commonly called the “QR decomposition” for an orthogonal matrix \(Q\), and an upper triangular matrix \(R\). This expression also yields the Cholesky decomposition of \(X'X\),
where \(R\) is upper triangular. Think about the uniqueness of these decompositions.
The General Linear Model
Suppose \(x_1, \ldots, x_p \in \mathbb{R}\) are covariates/features with a meaningful interpretation in some setting such that a random variable \(Y\) is generated as
for some coefficients \(\beta_0, \beta_1, \ldots, \beta_p \in \mathbb{R}\) and some random variable \(U\) with \(E(U) = 0\). This is regarded as the general linear model, and repeated sampling from it leads to a sample of size \(n\), where
Equivalently,
where \(Y \in \mathbb{R}^n\), \(X \in \mathbb{R}^{n \times p}\), \(\beta \in \mathbb{R}^p\), and \(U \in \mathbb{R}^n\). Note that only the vector \(Y\) is commonly understood as the data. The matrix \(X\) is referred to as the design matrix, and \(\beta\) is regarded as fixed but unknown.
The statistical inference problem is to estimate the unknown parameters \(\beta\). A natural candidate for an estimate of \(\beta\) is the more general idea of the least squares solution
If we propose this estimator, then we need to study its sampling distribution. For example, if \(U \sim N_n(0, \sigma^2 I_n)\), then
if \(\text{rank}(X) = p\).