1. In problem set 2 you were asked to propose a statistical model with population features that may be appropriate to estimate from the real data set you have chosen for your class project. Given the statistical model or models that you proposed, the next step is to determine how to fit the proposed model(s). What estimation procedures and algorithms could you use to fit your proposed model(s)? For example, common options are least squares estimates, maximum likelihood estimates, method of moments estimates. You do not need to fully specify an estimation procedure or algorithm, just search for resources to learn more about your proposed model(s) and list appropriate estimation procedures and/or algorithms with a brief explanation of why they are appropriate.

  2. Let \(V\) be a convex subset of some vector space. Recall that a function \(f : V \to \mathbb{R}\) is said to be convex if for every \(x, y \in V\) and every \(\lambda \in [0, 1]\),

    \[ f(\lambda x + (1 - \lambda)y) \leq \lambda f(x) + (1 - \lambda)f(y). \]

    Show, by definition, that the sum of squared errors function

    \[ Q(\beta) := \|Y - X\beta\|_2^2 \]

    is convex.

  3. The defining property of a projection matrix \(A\) is that \(A \cdot A = A\). Establish the following facts.

    1. If \(A\) is a projection matrix, then all of its eigenvalues are either zero or one.
    2. If \(A \in \mathbb{R}^{p \times p}\) is a projection and symmetric (i.e., an orthogonal projection matrix), then for every vector \(v\) the projection \(Av\) is orthogonal to \(v - Av\).
    1. Write an R function that takes as input \((y, X)\), where \(y\) is an \(n\)-dimensional vector and \(X\) is an \(n \times p\) matrix, and returns the least squares coefficient estimates by solving the normal equations

      \[ X'Xb = X'y \]

      with Gaussian elimination and the back-substitution algorithms.

    2. Generate synthetic regression data for various choices of \(n\) and \(p\) to test whether your least squares estimation procedure in part (a) works. Note, you should compare your least squares solution to the “true” coefficient values that you used to generate the data. Show that the quantity \(\|\hat{b} - b\|_2 < 10^{-4}\) for sufficiently large values of \(n\).

    3. Plot the regression line using the coefficients from part (a), for synthetic simple linear regression data (i.e., \(p = 2\) and \(b_0\) is an intercept).

  4. Repeat parts (a), (b), and (c) for problem 4 using the singular value decomposition of \(X = UDV'\) to simplify the task of solving the normal equations \(X'Xb = X'y\). Do you still need to use the Gaussian elimination and back-substitution algorithms?