Linear algebra is the study of vector spaces, linear transformations, matrices, and inner product spaces.
Definition. A vector space \(V\) over a field \(F\) is a set of elements \(v \in V\), with the operations of addition and scalar multiplication, that satisfy the following axioms.
- \(\forall\, x, y \in V,\quad x + y = y + x\)
- \(\forall\, x, y, z \in V,\quad (x + y) + z = x + (y + z)\)
- There exists an element “\(0\)” \(\in V\) s.t. \(x + 0 = x\), \(\forall\, x \in V\)
- \(\forall\, x \in V\; \exists\, y \in V\) s.t. \(x + y = 0\)
- There exists an element “\(1\)” \(\in F\) s.t. \(1 \cdot x = x\), \(\forall\, x \in V\)
- \(\forall\, a, b \in F,\; \forall\, x \in V,\quad (ab)x = a(bx)\)
- \(\forall\, a \in F,\; \forall\, x, y \in V,\quad a(x + y) = ax + ay\)
- \(\forall\, a, b \in F,\; \forall\, x \in V,\quad (a + b)x = ax + bx\) \(\square\)
In statistics contexts, typically \(V = \mathbb{R}^n\) and \(F = \mathbb{R}\).
The notion of a vector space is truly foundational for data science. A vector space gives us an entire framework and intuition for thinking about data objects (even as data are stored as objects in a computer).
Definition. A collection of vectors \(x_1, \ldots, x_p\) are said to be linearly dependent if and only if there exist scalars \(c_1, \ldots, c_p\), not all zero, such that
A set of vectors that are *not* linearly dependent are said to be linearly independent. \(\square\)
Often, to show that a set of vectors are linearly independent, a good strategy is to demonstrate that \(c_1 x_1 + \cdots + c_p x_p = 0\) implies that \(c_1 = \cdots = c_p = 0\).
Note that if \(x_1, \ldots, x_p \in \mathbb{R}^n\), then there can be at most \(\min\{n, p\}\) linearly independent of the \(p\) vectors.
Definition. A basis, \(B\), for a vector space, \(V\), is a linearly independent subset of \(V\) with \(\text{span}(B) = V.\) \(\square\)
Example. For \(V = \mathbb{R}^n\), the set \(\{e_1, \ldots, e_n\}\), where \(e_i \in \mathbb{R}^n\) has \(i\)th component with value 1 and 0 for all other components, is a basis. This is commonly called the standard basis, and \(e_i\) a standard basis vector. \(\square\)
Theorem. Let \(V\) be a vector space and \(B := \{u_1, \ldots, u_n\} \subseteq V\). Then \(B\) is a basis for \(V\) if and only if every \(v \in V\) can be expressed uniquely as
for some scalars \(a_1, \ldots, a_n\). \(\square\)
Proof. Left as an exercise. \(\square\)
Theorem. If a vector space \(V\) is generated by some finite set \(S\), then some subset of \(S\) is a basis for \(V\). \(\square\)
Proof. Left as an exercise. \(\square\)
Definition. Let \(V\) and \(W\) be vector spaces over field \(\mathbb{R}\). We call a function \(T : V \to W\) a linear transformation from \(V\) to \(W\) if \(\forall\, x, y \in V\) and \(\forall\, c \in \mathbb{R}\),
- \(T(x + y) = T(x) + T(y)\)
- \(T(cx) = cT(x).\) \(\square\)
Example. Let \(A \in \mathbb{R}^{p \times q}\) and define \(T(x) := Ax\) \(\forall\, x \in \mathbb{R}^q\). Then \(T : \mathbb{R}^q \to \mathbb{R}^p\) is a linear transformation since
- \(\forall\, x, y \in \mathbb{R}^q,\quad T(x + y) = A(x + y) = Ax + Ay = T(x) + T(y)\), and
- \(\forall\, x \in \mathbb{R}^q,\; \forall\, c \in \mathbb{R},\quad T(cx) = A \cdot (cx) = cAx = cT(x).\) \(\square\)
Example. Let \(V = C(\mathbb{R})\), the vector space of continuous real-valued functions defined on \(\mathbb{R}\). Then \(\forall\, a, b \in \mathbb{R}\) with \(a < b\), the transformation \(T : V \to \mathbb{R}\) defined by,
is linear. \(\square\)
Definition. Let \(V\) and \(W\) be vector spaces, and \(T : V \to W\) a linear transformation. The set,
is called the null space of \(T\). The range of \(T\) is defined as,
Further, \(\text{rank}(T) := \dim(\text{range}(T))\). \(\square\)
Theorem. (dimension theorem) Let \(V\) and \(W\) be vector spaces, and \(T : V \to W\) be linear. If \(V\) is finite-dimensional, then
Proof. See any linear algebra textbook. \(\square\)
Recall that if \(T(x) := Ax\), then \(\text{range}(T)\) is called the column space of \(A\), denoted \(\text{col}(A)\). Similarly for \(\text{range}(T')\), called the row space of \(A\), and denoted by \(\text{row}(A)\).
Definition. Let \(A \in \mathbb{R}^{p \times q}\) and \(B \in \mathbb{R}^{q \times m}\). Then the matrix product, denoted by \(AB\), has components defined as
for \(i \in \{1, \ldots, p\}\) and \(j \in \{1, \ldots, m\}\). \(\square\)
Further notions for matrices.
- The transpose of \(A\), denoted \(A'\), satisfies the property that \((A')_{ij} = A_{ji}\) for all indices \(i, j\).
- \(A \in \mathbb{R}^{p \times q}\) is called square if \(p = q\).
- A square matrix is called symmetric if \(A' = A\).
- \(A \in \mathbb{R}^{p \times p}\) is said to be invertible if \(\exists\, B \in \mathbb{R}^{p \times p}\) s.t. \(AB = I_p = BA\). In that case, \(A^{-1} := B\).
- \(A \in \mathbb{R}^{p \times p}\), the trace of \(A\) is defined as \(\displaystyle\text{tr}(A) := \sum_{i=1}^{p} A_{ii}\).
Selected properties of matrices.
- \((AB)' = B'A'\)
- \(A \in \mathbb{R}^{p \times p}\) is invertible if and only if \(\text{rank}(A) = p\) (equivalently \(\det(A) \neq 0\)).
- If both \(A, B \in \mathbb{R}^{p \times p}\) are invertible, then \((AB)^{-1} = B^{-1}A^{-1}\).
- For \(A\) and \(B\) of appropriate dimensions, \(\text{tr}(AB) = \text{tr}(BA)\).
Definition. Let \(A \in \mathbb{R}^{p \times p}\). A nonzero vector \(v \in \mathbb{R}^p\) is called an eigenvector of \(A\) if \(Av = \lambda v\) for some scalar \(\lambda\), called an eigenvalue. \(\square\)
Theorem. A scalar \(\lambda\) is an eigenvalue of a matrix \(A \in \mathbb{R}^{p \times p}\) if and only if \(\det(A - \lambda I_p) = 0\). The polynomial \(f(t) := \det(A - tI_p)\) is called the characteristic polynomial of \(A\). \(\square\)
Proof. See any linear algebra textbook. \(\square\)
For example, if
then \(\forall\, v \in \mathbb{R}^3\), \(Av = 1 \cdot v\). How about
which is a cubic polynomial with roots \(t = 1\), \(t = 4\), \(t = 6\).
How to find the eigenvectors of
and verify that \(1, 4, 6\) are the eigenvalues of \(A\)?
Observe that the eigenvectors must satisfy the system of linear equations
For \(\lambda = 6\),
To verify,
In general, for a system of linear equations in row echelon form,
Suppose \(M_{ij} \neq 0\) \(\forall\, j \geq i\). Then
In general, this is called back substitution
Eigenvalues have an important role in matrix decompositions for certain classes of matrices. In particular, consider the following fundamental result for the class of symmetric matrices
Spectral Theorem. For any symmetric matrix \(A \in \mathbb{R}^{p \times p}\) there exists an orthogonal matrix \(Q\) (i.e., \(Q'Q = QQ' = I_p\)) such that,
where \(D\) is a diagonal matrix composed of the eigenvalues of \(A\). \(\square\)
Note also that the columns of \(Q\) form an orthonormal basis for \(\mathbb{R}^p\) consisting of eigenvectors of \(A\).
The basis part means that every vector \(v \in \mathbb{R}^p\) can be expressed as
for some \(a_1, \ldots, a_p \in \mathbb{R}\), and that \(Q_1, \ldots, Q_p\) are linearly independent.
Recall the definition of linear independence:
Vectors \(u_1, \ldots, u_n \in \mathbb{R}^p\) are said to be linearly independent if
A set of vectors are said to be linearly dependent if they are not linearly independent.
Orthonormal means \(\forall\, i, j \in \{1, \ldots, p\}\)
Next, verify that the columns \(Q_1, \ldots, Q_p\) are eigenvectors of \(A\) corresponding to eigenvalues given by the diagonal components of \(D\). Take any \(i \in \{1, \ldots, p\}\):
Thus, \(Q_i\) is an eigenvector of \(A\), by definition, and by the same definition \(D_{ii}\) is an eigenvalue.
Further properties: \(\displaystyle\text{tr}(A) = \sum_{i=1}^{p} \lambda_i\) and \(\displaystyle\det(A) = \prod_{i=1}^{p} \lambda_i\).
What if a matrix is not symmetric nor even square? How to decompose?
Let \(X \in \mathbb{R}^{n \times p}\). Then \(X'X \in \mathbb{R}^{p \times p}\) and \(XX' \in \mathbb{R}^{n \times n}\), and both are symmetric:
Accordingly, we can apply the Spectral theorem to both \(X'X\) and \(XX'\) to decompose as
for matrices \(V \in \mathbb{R}^{p \times p}\) and \(U \in \mathbb{R}^{n \times n}\), both orthogonal matrices, and diagonal matrices \(\Delta_1\) and \(\Delta_2\).
Moreover, it can be shown that \(X = U\Sigma V'\), where \(\Sigma\) is a diagonal matrix with nonzero diagonal components equal to the square roots of the nonzero diagonal components of both \(\Delta_1\) and \(\Delta_2\). The diagonal components of \(\Sigma\) are called the singular values of \(X\). For symmetric matrices \(A \in \mathbb{R}^{p \times p}\) with singular values \(\sigma_1, \ldots, \sigma_p\) and eigenvalues \(\lambda_1, \ldots, \lambda_p\),
SVD and other decompositions are important for data compression and dimension reduction. Consider this for \(X \in \mathbb{R}^{n \times p}\) with \(\text{rank}(X) = r\),
where \(U_1, \ldots, U_n \in \mathbb{R}^n\) and \(V_1, \ldots, V_r \in \mathbb{R}^p\). Storing raw \(X\) requires \(n \cdot p\) pieces of information. Storing the SVD requires
pieces of information. For example, suppose \(n = 100\), \(p = 30\). Then
Furthermore, if \(\sigma_k, \ldots, \sigma_r \approx 0\), for some \(k \leq r\), then
which allows for even less memory requirements.