1. In previous assignments you were asked to propose a statistical model with population features that may be appropriate to estimate from the real data set you have chosen for your class project, and you were asked to consider various estimation procedures and algorithms. To evaluate an estimation procedure and algorithm, you will want to generate synthetic data from your posited statistical model. The “true” parameter values should be set as the parameter estimates from the real data set, and the point of a simulation study of synthetic data is to determine if you are able to correctly estimate the “true” parameter values on synthetic data. If so, the simulation study will lend credibility to the fitted real data model, and if not, the simulation study will help to identify and correct any mistakes in the estimation procedure/algorithm code or any shortcomings in the model formulation. Using pseudocode, describe how you will be able to generate synthetic data from your posited statistical model.

  2. Let \(S\) be a nonempty subset of an inner product space \(V\). The orthogonal complement to the set \(S\) is defined as

    \[ S^\perp := \{x \in V : \langle x, y \rangle = 0 \text{ for every } y \in S\}. \]
    1. Show that \(S^\perp\) is a subspace of \(V\) for any \(S \subseteq V\).
    2. Let \(W \subseteq V\) be a finite dimensional subspace, and let \(y \in V\). Show that there exist unique vectors \(u \in W\) and \(z \in W^\perp\) such that \(y = u + z\).
    3. Let \(X \in \mathbb{R}^{n \times p}\). Verify that \(\text{col}(X)\) and \(\text{null}(X')\) are orthogonal complements.
  3. Recall that in the case that \(X \in \mathbb{R}^{n \times p}\) does not have full column rank, \((X'X)^{-1}\) does not exist, and so \(\mathcal{P}_X := X(X'X)^{-1}X'\) does not exist. However, using the SVD of \(X\) we can still construct an orthogonal projection matrix onto \(\text{col}(X)\). Write an R function that takes as input \((X)\), where \(X\) is an \(n \times p\) matrix, and returns the orthogonal projection matrix onto \(\text{col}(X)\), regardless of the column rank of \(X\).

  4. By hand, find an orthonormal basis of vectors for the subspace spanned by the set

    \[ \left\{ \begin{pmatrix} 1 \\ 1 \\ 1 \\ 1 \\ 1 \end{pmatrix},\; \begin{pmatrix} 1 \\ 0 \\ 1 \\ 1 \\ 0 \end{pmatrix},\; \begin{pmatrix} 0 \\ 1 \\ 1 \\ 0 \\ 1 \end{pmatrix} \right\}. \]
  5. Write an R function that takes as input \((X)\), where \(X\) is an \(n \times p\) matrix with full column rank, and returns, via the Gram-Schmidt orthonormalization algorithm, an \(n \times p\) orthonormal matrix \(Q\) such that \(\text{col}(Q) = \text{col}(X)\), along with a \(p \times p\) upper-triangular matrix \(R\) such that \(X = QR\).

    1. Write an R function that takes as input \((y, X)\), where \(y\) is an \(n\)-dimensional vector and \(X\) is an \(n \times p\) matrix, and returns the least squares coefficient estimates by solving the normal equations

      \[ X'Xb = X'y \]

      using the QR decomposition of \(X\).

    2. Generate synthetic regression data for various choices of \(n\) and \(p\) to test whether your least squares estimation procedure in part (a) works. Note, you should compare your least squares solution to the “true” coefficient values that you used to generate the data. Show that the quantity \(\|\hat{b} - b\|_2 < 10^{-4}\) for sufficiently large values of \(n\).

    3. Plot the regression line using the coefficients from part (a), for synthetic simple linear regression data (i.e., \(p = 2\) and \(b_0\) is an intercept).