Linear algebra is the study of vector spaces, linear transformations, matrices, and inner product spaces.

Definition. A vector space \(V\) over a field \(F\) is a set of elements \(v \in V\), with the operations of addition and scalar multiplication, that satisfy the following axioms.

  1. \(\forall\, x, y \in V,\quad x + y = y + x\)
  2. \(\forall\, x, y, z \in V,\quad (x + y) + z = x + (y + z)\)
  3. There exists an element “\(0\)” \(\in V\) s.t. \(x + 0 = x\), \(\forall\, x \in V\)
  4. \(\forall\, x \in V\; \exists\, y \in V\) s.t. \(x + y = 0\)
  5. There exists an element “\(1\)” \(\in F\) s.t. \(1 \cdot x = x\), \(\forall\, x \in V\)
  6. \(\forall\, a, b \in F,\; \forall\, x \in V,\quad (ab)x = a(bx)\)
  7. \(\forall\, a \in F,\; \forall\, x, y \in V,\quad a(x + y) = ax + ay\)
  8. \(\forall\, a, b \in F,\; \forall\, x \in V,\quad (a + b)x = ax + bx\) \(\square\)

In statistics contexts, typically \(V = \mathbb{R}^n\) and \(F = \mathbb{R}\).

The notion of a vector space is truly foundational for data science. A vector space gives us an entire framework and intuition for thinking about data objects (even as data are stored as objects in a computer).

Definition. A collection of vectors \(x_1, \ldots, x_p\) are said to be linearly dependent if and only if there exist scalars \(c_1, \ldots, c_p\), not all zero, such that

\[ \sum_{i=1}^{p} c_i x_i = 0. \]

A set of vectors that are *not* linearly dependent are said to be linearly independent. \(\square\)

Often, to show that a set of vectors are linearly independent, a good strategy is to demonstrate that \(c_1 x_1 + \cdots + c_p x_p = 0\) implies that \(c_1 = \cdots = c_p = 0\).

Note that if \(x_1, \ldots, x_p \in \mathbb{R}^n\), then there can be at most \(\min\{n, p\}\) linearly independent of the \(p\) vectors.

Definition. A basis, \(B\), for a vector space, \(V\), is a linearly independent subset of \(V\) with \(\text{span}(B) = V.\) \(\square\)

Example. For \(V = \mathbb{R}^n\), the set \(\{e_1, \ldots, e_n\}\), where \(e_i \in \mathbb{R}^n\) has \(i\)th component with value 1 and 0 for all other components, is a basis. This is commonly called the standard basis, and \(e_i\) a standard basis vector. \(\square\)

Theorem. Let \(V\) be a vector space and \(B := \{u_1, \ldots, u_n\} \subseteq V\). Then \(B\) is a basis for \(V\) if and only if every \(v \in V\) can be expressed uniquely as

\[ v = a_1 u_1 + \cdots + a_n u_n \]

for some scalars \(a_1, \ldots, a_n\). \(\square\)

Proof. Left as an exercise. \(\square\)

Theorem. If a vector space \(V\) is generated by some finite set \(S\), then some subset of \(S\) is a basis for \(V\). \(\square\)

Proof. Left as an exercise. \(\square\)

Definition. Let \(V\) and \(W\) be vector spaces over field \(\mathbb{R}\). We call a function \(T : V \to W\) a linear transformation from \(V\) to \(W\) if \(\forall\, x, y \in V\) and \(\forall\, c \in \mathbb{R}\),

  1. \(T(x + y) = T(x) + T(y)\)
  2. \(T(cx) = cT(x).\) \(\square\)

Example. Let \(A \in \mathbb{R}^{p \times q}\) and define \(T(x) := Ax\) \(\forall\, x \in \mathbb{R}^q\). Then \(T : \mathbb{R}^q \to \mathbb{R}^p\) is a linear transformation since

  1. \(\forall\, x, y \in \mathbb{R}^q,\quad T(x + y) = A(x + y) = Ax + Ay = T(x) + T(y)\), and
  2. \(\forall\, x \in \mathbb{R}^q,\; \forall\, c \in \mathbb{R},\quad T(cx) = A \cdot (cx) = cAx = cT(x).\) \(\square\)

Example. Let \(V = C(\mathbb{R})\), the vector space of continuous real-valued functions defined on \(\mathbb{R}\). Then \(\forall\, a, b \in \mathbb{R}\) with \(a < b\), the transformation \(T : V \to \mathbb{R}\) defined by,

\[ T(f) := \int_a^b f(x)\, dx \]

is linear. \(\square\)

Definition. Let \(V\) and \(W\) be vector spaces, and \(T : V \to W\) a linear transformation. The set,

\[ \text{null}(T) := \{x \in V : T(x) = 0\}, \]

is called the null space of \(T\). The range of \(T\) is defined as,

\[ \text{range}(T) := \{y \in W : T(x) = y \text{ for some } x \in V\}. \]

Further, \(\text{rank}(T) := \dim(\text{range}(T))\). \(\square\)

Theorem. (dimension theorem) Let \(V\) and \(W\) be vector spaces, and \(T : V \to W\) be linear. If \(V\) is finite-dimensional, then

\[ \dim(\text{null}(T)) + \text{rank}(T) = \dim(V). \]
\(\square\)

Proof. See any linear algebra textbook. \(\square\)

Recall that if \(T(x) := Ax\), then \(\text{range}(T)\) is called the column space of \(A\), denoted \(\text{col}(A)\). Similarly for \(\text{range}(T')\), called the row space of \(A\), and denoted by \(\text{row}(A)\).

Definition. Let \(A \in \mathbb{R}^{p \times q}\) and \(B \in \mathbb{R}^{q \times m}\). Then the matrix product, denoted by \(AB\), has components defined as

\[ (AB)_{ij} := \sum_{k=1}^{q} A_{ik} B_{kj} \]

for \(i \in \{1, \ldots, p\}\) and \(j \in \{1, \ldots, m\}\). \(\square\)

Further notions for matrices.

  1. The transpose of \(A\), denoted \(A'\), satisfies the property that \((A')_{ij} = A_{ji}\) for all indices \(i, j\).
  2. \(A \in \mathbb{R}^{p \times q}\) is called square if \(p = q\).
  3. A square matrix is called symmetric if \(A' = A\).
  4. \(A \in \mathbb{R}^{p \times p}\) is said to be invertible if \(\exists\, B \in \mathbb{R}^{p \times p}\) s.t. \(AB = I_p = BA\). In that case, \(A^{-1} := B\).
  5. \(A \in \mathbb{R}^{p \times p}\), the trace of \(A\) is defined as \(\displaystyle\text{tr}(A) := \sum_{i=1}^{p} A_{ii}\).

Selected properties of matrices.

  1. \((AB)' = B'A'\)
  2. \(A \in \mathbb{R}^{p \times p}\) is invertible if and only if \(\text{rank}(A) = p\) (equivalently \(\det(A) \neq 0\)).
  3. If both \(A, B \in \mathbb{R}^{p \times p}\) are invertible, then \((AB)^{-1} = B^{-1}A^{-1}\).
  4. For \(A\) and \(B\) of appropriate dimensions, \(\text{tr}(AB) = \text{tr}(BA)\).

Definition. Let \(A \in \mathbb{R}^{p \times p}\). A nonzero vector \(v \in \mathbb{R}^p\) is called an eigenvector of \(A\) if \(Av = \lambda v\) for some scalar \(\lambda\), called an eigenvalue. \(\square\)

Theorem. A scalar \(\lambda\) is an eigenvalue of a matrix \(A \in \mathbb{R}^{p \times p}\) if and only if \(\det(A - \lambda I_p) = 0\). The polynomial \(f(t) := \det(A - tI_p)\) is called the characteristic polynomial of \(A\). \(\square\)

Proof. See any linear algebra textbook. \(\square\)

For example, if

\[ A = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix}, \]

then \(\forall\, v \in \mathbb{R}^3\), \(Av = 1 \cdot v\). How about

\[ A = \begin{pmatrix} 1 & 2 & 3 \\ 0 & 4 & 5 \\ 0 & 0 & 6 \end{pmatrix}? \]
\[ f(t) = \det(A - tI_3) \] \[ = \det \begin{pmatrix} 1-t & 2 & 3 \\ 0 & 4-t & 5 \\ 0 & 0 & 6-t \end{pmatrix} \] \[ = (1-t)\det\begin{pmatrix} 4-t & 5 \\ 0 & 6-t \end{pmatrix} - 0 + 0 \] \[ = (1-t)\left[(4-t)(6-t) - 0\right] \] \[ = (1-t)(4-t)(6-t) \]

which is a cubic polynomial with roots \(t = 1\), \(t = 4\), \(t = 6\).

How to find the eigenvectors of

\[ A = \begin{pmatrix} 1 & 2 & 3 \\ 0 & 4 & 5 \\ 0 & 0 & 6 \end{pmatrix}, \]

and verify that \(1, 4, 6\) are the eigenvalues of \(A\)?

Observe that the eigenvectors must satisfy the system of linear equations

\[ Av = \lambda v \quad \text{or} \quad (A - \lambda I)v = 0 \]
\[ \begin{aligned} (1 - \lambda)v_1 + 2v_2 + 3v_3 &= 0 \\ (4 - \lambda)v_2 + 5v_3 &= 0 \\ (6 - \lambda)v_3 &= 0 \end{aligned} \]

For \(\lambda = 6\),

\[ \left. \begin{aligned} -5v_1 + 2v_2 + 3v_3 &= 0 \\ -2v_2 + 5v_3 &= 0 \\ 0 &= 0 \end{aligned} \right\} \text{WLOG take } v_3 = 1 \]
\[ v_2 = \tfrac{5}{2} \] \[ v_1 = \tfrac{2}{5}v_2 + \tfrac{3}{5} = \tfrac{8}{5} \]

To verify,

\[ A \cdot v = \begin{pmatrix} 1 & 2 & 3 \\ 0 & 4 & 5 \\ 0 & 0 & 6 \end{pmatrix} \begin{pmatrix} 8/5 \\ 5/2 \\ 1 \end{pmatrix} = \begin{pmatrix} 48/5 \\ 15 \\ 6 \end{pmatrix} = 6 \cdot \begin{pmatrix} 8/5 \\ 5/2 \\ 1 \end{pmatrix} = \lambda v \]

In general, for a system of linear equations in row echelon form,

\[ \begin{pmatrix} M_{11} & M_{12} & \cdots & M_{1p} \\ 0 & M_{22} & \cdots & M_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & M_{pp} \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_p \end{pmatrix} = \begin{pmatrix} b_1 \\ b_2 \\ \vdots \\ b_p \end{pmatrix} \]

Suppose \(M_{ij} \neq 0\) \(\forall\, j \geq i\). Then

\[ x_p = \frac{b_p}{M_{pp}} \] \[ \vdots \] \[ x_2 = \left(b_2 - \sum_{j=3}^{p} M_{2j} x_j\right) \frac{1}{M_{22}} \] \[ x_1 = \left(b_1 - \sum_{j=2}^{p} M_{1j} x_j\right) \frac{1}{M_{11}} \]

In general, this is called back substitution

\[ x_i = \left(b_i - \sum_{j=i+1}^{p} M_{ij} x_j\right) \frac{1}{M_{ii}} \qquad \text{for } i \in \{1, \ldots, p\} \]

Eigenvalues have an important role in matrix decompositions for certain classes of matrices. In particular, consider the following fundamental result for the class of symmetric matrices

Spectral Theorem. For any symmetric matrix \(A \in \mathbb{R}^{p \times p}\) there exists an orthogonal matrix \(Q\) (i.e., \(Q'Q = QQ' = I_p\)) such that,

\[ A = QDQ', \]

where \(D\) is a diagonal matrix composed of the eigenvalues of \(A\). \(\square\)

Note also that the columns of \(Q\) form an orthonormal basis for \(\mathbb{R}^p\) consisting of eigenvectors of \(A\).

The basis part means that every vector \(v \in \mathbb{R}^p\) can be expressed as

\[ v = \sum_{i=1}^{p} Q_i a_i, \]

for some \(a_1, \ldots, a_p \in \mathbb{R}\), and that \(Q_1, \ldots, Q_p\) are linearly independent.

Recall the definition of linear independence:

Vectors \(u_1, \ldots, u_n \in \mathbb{R}^p\) are said to be linearly independent if

\[ \sum_{i=1}^{n} u_i a_i = 0 \quad \text{implies that} \quad a_1 = \cdots = a_n = 0 \]

A set of vectors are said to be linearly dependent if they are not linearly independent.

Orthonormal means \(\forall\, i, j \in \{1, \ldots, p\}\)

\[ Q_i' Q_j = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{if } i \neq j \end{cases}. \]

Next, verify that the columns \(Q_1, \ldots, Q_p\) are eigenvectors of \(A\) corresponding to eigenvalues given by the diagonal components of \(D\). Take any \(i \in \{1, \ldots, p\}\):

\[ AQ_i = QDQ'Q_i = QD \begin{pmatrix} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{pmatrix} = Q \begin{pmatrix} 0 \\ \vdots \\ D_{ii} \\ \vdots \\ 0 \end{pmatrix} = D_{ii} \cdot Q_i \]

Thus, \(Q_i\) is an eigenvector of \(A\), by definition, and by the same definition \(D_{ii}\) is an eigenvalue.

Further properties: \(\displaystyle\text{tr}(A) = \sum_{i=1}^{p} \lambda_i\) and \(\displaystyle\det(A) = \prod_{i=1}^{p} \lambda_i\).

\[ \text{tr}(A) = \text{tr}(QDQ') = \text{tr}(Q'QD) = \text{tr}(D) = \sum_{i=1}^{p} D_{ii} \]
\[ \det(A) = \det(QDQ') = \det(Q) \cdot \det(D) \cdot \det(Q') = \det(QQ') \cdot \det(D) = 1 \cdot \prod_{i=1}^{p} D_{ii} \]

What if a matrix is not symmetric nor even square? How to decompose?

Let \(X \in \mathbb{R}^{n \times p}\). Then \(X'X \in \mathbb{R}^{p \times p}\) and \(XX' \in \mathbb{R}^{n \times n}\), and both are symmetric:

\[ (X'X)' = X'(X')' = X'X \] \[ (XX')' = (X')'\, X' = XX' \]

Accordingly, we can apply the Spectral theorem to both \(X'X\) and \(XX'\) to decompose as

\[ X'X = V\Delta_1 V' \qquad \text{and} \qquad XX' = U\Delta_2 U' \]

for matrices \(V \in \mathbb{R}^{p \times p}\) and \(U \in \mathbb{R}^{n \times n}\), both orthogonal matrices, and diagonal matrices \(\Delta_1\) and \(\Delta_2\).

Moreover, it can be shown that \(X = U\Sigma V'\), where \(\Sigma\) is a diagonal matrix with nonzero diagonal components equal to the square roots of the nonzero diagonal components of both \(\Delta_1\) and \(\Delta_2\). The diagonal components of \(\Sigma\) are called the singular values of \(X\). For symmetric matrices \(A \in \mathbb{R}^{p \times p}\) with singular values \(\sigma_1, \ldots, \sigma_p\) and eigenvalues \(\lambda_1, \ldots, \lambda_p\),

\[ \sigma_i = |\lambda_i| \quad \text{for } i \in \{1, \ldots, p\} \]

SVD and other decompositions are important for data compression and dimension reduction. Consider this for \(X \in \mathbb{R}^{n \times p}\) with \(\text{rank}(X) = r\),

\[ X = U\Sigma V' = (U_1, \ldots, U_n) \begin{pmatrix} \sigma_1 & & & \\ & \ddots & & \\ & & \sigma_r & \\ & & & 0 \\ & & & & \ddots \\ & & & & & 0 \end{pmatrix} \begin{pmatrix} V_1' \\ \vdots \\ V_p' \end{pmatrix} \]
\[ = (U_1, \ldots, U_r) \begin{pmatrix} \sigma_1 & & \\ & \ddots & \\ & & \sigma_r \end{pmatrix} \begin{pmatrix} V_1' \\ \vdots \\ V_r' \end{pmatrix} \quad \text{``compact SVD''} \]

where \(U_1, \ldots, U_n \in \mathbb{R}^n\) and \(V_1, \ldots, V_r \in \mathbb{R}^p\). Storing raw \(X\) requires \(n \cdot p\) pieces of information. Storing the SVD requires

\[ n \cdot r + r + p \cdot r = (n + p + 1) \cdot r \]

pieces of information. For example, suppose \(n = 100\), \(p = 30\). Then

\[ (n + p + 1) \cdot r = 131 \cdot r < 3000 = n \cdot p \quad \text{if } r < 23. \]

Furthermore, if \(\sigma_k, \ldots, \sigma_r \approx 0\), for some \(k \leq r\), then

\[ X \approx (U_1, \ldots, U_k) \begin{pmatrix} \sigma_1 & & \\ & \ddots & \\ & & \sigma_k \end{pmatrix} \begin{pmatrix} V_1' \\ \vdots \\ V_k' \end{pmatrix} \]

which allows for even less memory requirements.