Magíster en Economía
Teoría Econométrica I (Econometric Theory I)
These slides closely follow the lecture notes I prepared for OLS, which are primarily based on:
Our goal is to look “under the hood” of OLS. We will separate the algebraic requirements needed to compute estimates from the statistical assumptions required for inference.
Let the data be denoted by \(\{w_i\}_{i=1}^N\), where \[ w_i=(y_i,x_i), \qquad y_i\in\mathbb{R}, \qquad x_i\in\mathbb{R}^k. \]
The joint density can always be factored into a conditional and a marginal component: \[ f(y_i,x_i;\theta)=f(y_i\mid x_i;\theta_1)\cdot f(x_i;\theta_2). \]
When we talk about regression, our main object of interest is the conditional component, \[ f(y_i\mid x_i;\theta_1). \]
That is, regression seeks to characterize how the distribution of the outcome \(y\) systematically varies with the covariates \(x\).
Rather than modeling the full conditional density, we can focus on one of its moments.
The natural starting point is the conditional mean. In particular, the Conditional Expectation Function (CEF) is defined as \[ m(x_i)\equiv \mathbb{E}[y_i\mid x_i]. \]
This object plays a central role in econometrics because it summarizes how the average value of the outcome changes with the covariates.
Much of econometrics can be viewed as (i) estimating the CEF, or some feature of it, and (ii) asking what additional structure is needed to give that object a causal interpretation.
Now define the regression error as the deviation from the conditional mean: \[ u_i\equiv y_i-m(x_i) \qquad \Longrightarrow \qquad y_i=m(x_i)+u_i. \]
Important: This decomposition is a definition, not an assumption. By construction, \(\mathbb{E}[u_i\mid x_i]=0\).
A leading special case arises when the CEF is linear: \(m(x_i)=x_i'\beta\).
This gives the Linear Regression Model: \[ y_i=x_i'\beta+u_i, \qquad \mathbb{E}[u_i\mid x_i]=0. \]
Because \(\mathbb{E}[u_i\mid x_i]=0\), the Law of Iterated Expectations implies \[ \mathbb{E}[u_i]=0. \]
More generally, we obtain unconditional orthogonality: \[ \mathbb{E}[h(x_i)u_i]=0 \qquad \text{for any measurable } h(\cdot), \] and in particular, \(\mathbb{E}[x_i u_i]=0\).
This orthogonality condition is the foundation of moment-based estimation.
If orthogonality fails, for example because of endogeneity arising from omitted variables, simultaneity, or measurement error, OLS will typically converge to the wrong object. Other estimators, such as IV and GMM, are designed for this case.
Statistical Models. Up to this point, everything has been a statistical statement. The decomposition \(y_i=m(x_i)+u_i\) describes conditional averages in the observable data, but it does not by itself have causal content.
Structural Models are motivated by economic theory and are intended to represent behavioral, technological, or institutional relationships. This is where identification becomes central.
A natural and highly tractable way to model the Conditional Expectation Function is to approximate it with a linear function. This leads to the Linear Regression Model.
Assumption 1 — Linearity
\[y_i = x_i'\beta + u_i,\qquad i=1,\ldots,N\]
In matrix form, stacking all \(N\) observations: \[y = X\beta + u\]
where \(y \in \mathbb{R}^{N\times 1}\), \(X \in \mathbb{R}^{N\times k}\), \(\beta \in \mathbb{R}^{k\times 1}\), \(u \in \mathbb{R}^{N\times 1}\).
This linear representation is the starting point for the algebra of OLS and for the finite-sample and asymptotic results that follow later.
The coefficient \(\beta_j\) is usually interpreted as a ceteris paribus effect.
| Model | Dependent variable | Regressor | Interpretation |
|---|---|---|---|
| Level–level | \(y\) | \(x\) | \(\Delta y = \beta_j\,\Delta x\) |
| Log–level | \(\log(y)\) | \(x\) | \(\%\Delta y \approx 100\beta_j\,\Delta x\) |
| Log–log | \(\log(y)\) | \(\log(x)\) | \(\%\Delta y = \beta_j\,\%\Delta x\) |
For observation \(i\), the model error is \(u_i = y_i - x_i'\beta\). After estimation, the sample analog, \(\hat u_i = y_i - x_i'\hat\beta\) is called the residual.
To estimate \(\beta\), Ordinary Least Squares (OLS) chooses the parameter vector that minimizes the sum of squared residuals: \[ \hat{\beta} = \arg\min_{\beta} S(\beta), \qquad S(\beta) = (y-X\beta)'(y-X\beta). \]
The logic is straightforward: squaring the residuals penalizes large errors and treats positive and negative deviations symmetrically.
Expanding the objective function gives \(S(\beta)=y'y-2\beta'X'y+\beta'X'X\beta\).
Taking derivatives with respect to \(\beta\) and setting them equal to zero yields the first-order condition: \[ \left.\frac{\partial S(\beta)}{\partial\beta}\right|_{\beta=\hat\beta} = -2X'y+2X'X\hat\beta=0. \]
This first-order condition is the key step that leads to the normal equations.
Rearranging the first-order condition gives the fundamental normal equations: \[X'X\hat\beta = X'y\]
But, to solve this system for a unique \(\hat\beta\), we need an invertibility condition.
Assumption 2 — Full Rank
The regressor matrix \(X \in \mathbb{R}^{N\times k}\) satisfies \(\operatorname{rank}(X)=k\). Equivalently, there is no exact multicollinearity among the regressors.
Under Assumption 2, the matrix \(X'X\) is symmetric positive definite and therefore invertible. Hence, \[ \boxed{\hat\beta_{\text{OLS}}=(X'X)^{-1}X'y.} \]
Two immediate implications are worth keeping in mind. (i) The second-order condition is \(\frac{\partial^2 S(\beta)}{\partial\beta\,\partial\beta'}=2X'X \succ 0\), so the OLS solution is the unique global minimum. (ii) The normal equations also imply \(X'\hat u = X'(y-X\hat\beta)=0\), which means that the residual vector is orthogonal to every column of the regressor matrix.
Once the OLS estimator has been obtained, the fitted values and residuals admit a useful geometric interpretation. The fitted values (\(\hat{y}= X\hat\beta\)) and residuals (\(\hat{u}=y-\hat{y}\)) can be written compactly using two matrices:
So, OLS splits the observed outcome vector into two parts: \(y=\hat y+\hat u\).
The matrices \(P\) and \(M\) summarize the key algebraic structure of OLS.
Main properties: \[ P'=P, \qquad M'=M \] \[ P^2=P, \qquad M^2=M \] \[ PX=X, \qquad MX=0 \] \[ PM=MP=0 \]
These results tell us that both matrices are symmetric and idempotent, and that they project onto orthogonal subspaces.
The geometric meaning of OLS: it chooses the point in \(\operatorname{col}(X)\) that is closest to the observed vector \(y\).
\(\,\)
In the figure, the point \(\hat y=X\hat\beta\) is the orthogonal projection of \(y\) onto the regressor space, and the residual vector \(\hat u\) is the gap between the observed outcome vector and its projection.

OLS minimizes \(\|\hat{u}\|^2 = \hat{u}'\hat{u}\) — the squared distance from \(y\) to \(\operatorname{col}(X)\).
The projection machinery developed above leads directly to one of the most useful algebraic results in linear regression: the Frisch–Waugh–Lovell (FWL) theorem.
FWL formalizes the concept of partialling out.
To do this, we first define the residual-maker matrix for \(X_1\): \[M_1 = I_N - X_1(X_1'X_1)^{-1}X_1'\]
Theorem — Frisch–Waugh–Lovell (FWL)
Partition \(X = [X_1 \; X_2]\). The OLS estimate of \(\beta_2\) from the full regression of \(y\) on \([X_1 \; X_2]\) equals: \[\hat\beta_2 = (\tilde{X}_2'\tilde{X}_2)^{-1}\tilde{X}_2'\tilde{y}\]
Moreover, the residuals from the full and auxiliary regressions are identical.
where, \(\tilde{X}_2\) and \(\tilde{y}\) are residualized variables, \(\tilde{X}_2 = M_1 X_2\) and \(\tilde{y} = M_1 y\).
This theorem says that the coefficient on \(X_2\) can be obtained either from the full regression or from a regression in which the linear influence of \(X_1\) has first been removed from both \(y\) and \(X_2\).
The projection results developed above also give rise to one of the most familiar summaries of regression fit.
Since \(y=\hat y+\hat u \qquad \text{and} \qquad \hat y'\hat u=0\), it follows that \[ \underbrace{y'M_{\iota}y}_{\text{TSS}} = \underbrace{\hat y'M_{\iota}\hat y}_{\text{ESS}} + \underbrace{\hat u'\hat u}_{\text{SSR}}. \]
This is the basic decomposition of variation in linear regression.
The key point is that OLS decomposes the observed variation in \(y\) into a part captured by the linear model and a part left in the residuals.
TSS \(=\sum_i (y_i-\bar y)^2\) is the total variation in the dependent variable, with \((N-1)\) degrees of freedom; ESS \(=\sum_i (\hat y_i-\bar y)^2\) is the variation accounted for by the regressors, with \((k-1)\) degrees of freedom when an intercept is included; and SSR \(=\sum_i \hat u_i^2\) is the unexplained variation, or residual variation, with \((N-k)\) degrees of freedom
This decomposition motivates the coefficient of determination, \[ R^2 = 1-\frac{\text{SSR}}{\text{TSS}} = \frac{\text{ESS}}{\text{TSS}} \in [0,1]. \]
Thus, \(R^2\) measures the fraction of the sample variation in \(y\) that is accounted for by the linear regression model.
To account for the loss of degrees of freedom when more regressors are added: \[ \text{Adjusted }R^2 = \bar R^2 = 1-(1-R^2)\frac{N-1}{N-k}. \]
Adding an uninformative regressor may reduce \(\bar R^2\). For this reason, adjusted \(R^2\) is often more informative than \(R^2\) when comparing models with different numbers of regressors.
Having established the algebraic mechanics of OLS, we now evaluate its statistical properties.
Is \(\hat\beta\) a reliable estimator of the true population parameter \(\beta\)? We begin by analyzing its finite-sample bias.
Recall the OLS estimator: \(\hat\beta=(X'X)^{-1}X'y\). Substituting the true data-generating process, \(y=X\beta+u\), into this expression yields: \[ \hat\beta = (X'X)^{-1}X'(X\beta + u) = \beta + (X'X)^{-1}X'u \]
Taking expectations conditional on the regressor matrix \(X\), and treating \(X\) as fixed given itself, we obtain: \[ \mathbb{E}[\hat\beta\mid X] = \beta + (X'X)^{-1}X'\mathbb{E}[u\mid X] \]
For OLS to be completely unbiased, the second term in the previous equation, \(\mathbb{E}[u\mid X]\), must vanish.
Assumption 3 — Zero Conditional Mean
\[ \mathbb{E}[u\mid X]=0 \]
That is, conditional on the full regressor matrix, the unobserved error term has a mean of exactly zero.
Note: This is arguably the most critical assumption in microeconometrics/applied econometrics (your next course). This assumption states that the regressors carry no systematic information about the unobserved shocks. So, if this assumption fails (whether due to omitted variables, measurement error, or simultaneous equations), we enter the domain of endogeneity. In this case, the second term in our derivation no longer evaluates to zero, making OLS fundamentally biased. Resolving this requires alternative identification strategies, such as Instrumental Variables.
Under Assumptions 1, 2, and 3, OLS is conditionally unbiased: \[ \mathbb{E}[\hat\beta\mid X]=\beta. \]
This is a reassuring repeated-sampling property: across many hypothetical samples, the OLS estimator is perfectly centered at the true population parameter.
Unbiasedness alone tells us nothing about precision. An estimator can be unbiased but highly dispersed, making any single estimate unreliable. Therefore, the next crucial object of interest is the variance-covariance matrix of OLS.
By definition, the conditional variance of the estimator is the expected outer product of its deviation from the mean, \(\operatorname{Var}(\hat\beta\mid X) = \mathbb{E}\bigl[(\hat\beta - \beta)(\hat\beta - \beta)' \mid X\bigr]\).
Recall from our previous derivation that the estimation error is \(\hat\beta-\beta=(X'X)^{-1}X'u\). Also, we are conditioning on \(X\), so the matrices involving \(X\) are treated as non-stochastic constants. Therefore,
\[ \operatorname{Var}(\hat\beta\mid X) = (X'X)^{-1}X'\,\mathbb{E}[uu'\mid X]\,X(X'X)^{-1} \]
This is known as the “sandwich” form: The “Bread”, \((X'X)^{-1}X'\), is driven entirely by the observed regressors, and the the “Meat”, \(\mathbb{E}[uu'\mid X]\), is the \(N \times N\) covariance matrix of the unobserved errors.
To obtain the classical, mathematically tractable textbook expression (and to prove the Gauss-Markov theorem), we must introduce an additional, highly restrictive assumption about the “meat”.
Assumption 4 — Spherical Disturbances
\[ \mathbb{E}[uu'\mid X]=\sigma^2 I_N. \]
Equivalently, conditional on \(X\), the disturbances are homoskedastic and mutually uncorrelated.
Under this assumption, the general sandwich formula collapses to \[ \operatorname{Var}(\hat\beta\mid X)=\sigma^2(X'X)^{-1}. \]
Under Assumptions 1–4, we obtain the exact finite-sample variance: \[\operatorname{Var}(\hat\beta\mid X)=\sigma^2(X'X)^{-1}\]
In empirical practice, however, the spherical-disturbance assumption (A4) frequently fails. When it does, we rely on robust “sandwich” estimators:
White’s heteroskedasticity-robust estimator: \(\widehat{\operatorname{Var}}_{rob}(\hat\beta\mid X) = (X'X)^{-1} \left( \sum_{i=1}^N \hat u_i^2 x_i x_i' \right) (X'X)^{-1}\)
Cluster-robust estimator (for within-group dependence): \(\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad \widehat{\operatorname{Var}}_{CR}(\hat\beta\mid X) = (X'X)^{-1} \left( \sum_{g=1}^G X_g'\hat u_g\hat u_g'X_g \right) (X'X)^{-1}\)
Notes: Unlike the classical formula, these robust estimators are not exact in finite samples. They are justified entirely by large-sample asymptotic theory (which we will cover later).
To make the classical variance formula feasible, we need an estimator for the unknown disturbance variance \(\sigma^2\).
The standard choice is based on the residual sum of squares: \[ \tilde\sigma^2=\frac{\hat u'\hat u}{N-k} \]
Theorem — Gauss–Markov (BLUE)
Under Assumptions 1–4, the OLS estimator is BLUE: the Best Linear Unbiased Estimator.
That is, for any other linear unbiased estimator \(\tilde\beta=Cy\) (where \(C\) is a matrix function of \(X\)), the difference in their variance-covariance matrices is positive semi-definite: \[ \operatorname{Var}(\tilde\beta\mid X)-\operatorname{Var}(\hat\beta\mid X)\succeq 0 \]
This guarantees that OLS is not just unbiased—it is the most efficient estimator within its class, providing the tightest possible sampling distribution.
Important: The theorem does not state that OLS is optimal among all possible estimators. It claims optimality only within the restricted class of estimators that are both linear in \(y\) and unbiased. If a researcher is willing to accept a small amount of bias (e.g., Ridge regression) or use nonlinear procedures, they may achieve a strictly lower Mean Squared Error (MSE).
(Please refer to the Lecture Notes for the complete proof).
Up to this point, we have derived the OLS estimator unconditionally, allowing the data to freely dictate the parameter estimates.
In many empirical applications, however, economic theory suggests that the parameters must satisfy exact linear relationships. Common examples include:
Suppose that our parameter vector \(\beta\) is subject to \(q\) linear restrictions: \[ Q'\beta=c, \qquad Q\in\mathbb{R}^{k\times q} \text{ with full column rank}, \qquad c\in\mathbb{R}^q \]
We now want to minimize the Sum of Squared Residuals (SSR) strictly over the subset of parameter values that satisfy these theoretical restrictions.
\[ \hat\beta_{CLS} = \arg\min_\beta S(\beta) \qquad \text{subject to} \qquad Q'\beta=c \]
To solve this constrained optimization problem, we set up a Lagrangian: \[ \mathcal{L}(\beta,\lambda) = (y-X\beta)'(y-X\beta) + 2\lambda'(Q'\beta-c) \] where \(\lambda\in\mathbb{R}^q\) is the vector of shadow prices (Lagrange multipliers) associated with the constraints.
Solving the first-order conditions with respect to both \(\beta\) and \(\lambda\) yields the closed-form CLS estimator: \[ \boxed{ \hat\beta_{CLS} = \hat\beta_{OLS} - (X'X)^{-1}Q\bigl[Q'(X'X)^{-1}Q\bigr]^{-1}(Q'\hat\beta_{OLS} - c) } \]
Note: The formula is often written with a plus sign by reversing the final term to \((c - Q'\hat\beta_{OLS})\).
The algebraic intuition is: it starts with the unrestricted \(\hat\beta_{OLS}\) and adjusts it by the exact minimum distance required to force the restrictions \(Q'\beta=c\) to hold.
If the unrestricted estimator happens to already satisfy the rule (\(Q'\hat\beta_{OLS}=c\)), the adjustment term evaluates to zero and \(\hat\beta_{CLS}=\hat\beta_{OLS}\).
The statistical properties depend entirely on whether the imposed theory is true:
Because imposing false restrictions damages the consistency of our estimator, we must formally test whether the data supports the theory. This leads us directly to the \(F\)-test.
Up to this point, the Gauss–Markov theorem guaranteed that OLS is BLUE relying only on assumptions about the conditional mean and variance.
However, obtaining a point estimate is not enough. We are also interested in inference. To conduct hypothesis tests and compute exact tail probabilities in finite samples, we must fully specify the data-generating process.
To conduct exact finite-sample inference, we add a distributional assumption to the classical linear model:
Assumption 5 — Normality
\[ u \mid X \sim \mathcal{N}(0,\sigma^2 I_N) \]
The normal linear regression model is theoretically elegant because the linear algebra of OLS maps directly into standard probability distributions.
Under Assumptions 1–5:
(i) \[ \hat\beta \mid X \sim \mathcal{N}\!\bigl(\beta,\;\sigma^2(X'X)^{-1}\bigr) \]
(ii) \[ \frac{\hat u'\hat u}{\sigma^2} \sim \chi^2(N-k) \]
(iii) \(\hat\beta\) and \(\hat u'\hat u\) are independent conditional on \(X\).
Why are these results true?
(i) Under normality, the estimator \(\hat\beta=\beta+(X'X)^{-1}X'u\) is strictly a linear transformation of a normal vector, which is always normally distributed.
(ii) The residual sum of squares can be written as a quadratic form in the idempotent matrix \(M\): \(\hat u'\hat u=u'Mu\). Because \(M\) has rank \(N-k\), this maps to the sum of \(N-k\) squared independent standard normal variables.
(iii) The estimator depends on \(u\) through \((X'X)^{-1}X'u\), while the residuals depend on \(u\) through \(Mu\). Since \(X'M=0\), these components are orthogonal; under joint normality, zero covariance (orthogonality) guarantees strict independence.
Homework: Prove (ii) and (iii) rigorously using the spectral decomposition of the idempotent matrix \(M\).
Suppose we want to test a null hypothesis on a single coefficient: \[ H_0:\beta_j=\beta_{j,0} \]
From the corollary above, we know the exact distribution of that specific coefficient: \[ \hat\beta_j \mid X \sim \mathcal{N}\!\bigl(\beta_j,\;\sigma^2[(X'X)^{-1}]_{jj}\bigr) \]
Therefore, the feasible test statistic is: \[ t_j=\frac{\hat\beta_j-\beta_{j,0}}{\sqrt{\tilde\sigma^2[(X'X)^{-1}]_{jj}}} \sim t_{N-k} \qquad \text{under } H_0 \]
The decision rule is to reject \(H_0\) at significance level \(\alpha\) if \(|t_j| > t_{N-k,\alpha/2}\).
A particularly common application is testing \(H_0:\beta_j=0\), which determines whether regressor \(j\) has a statistically discernible effect distinct from zero.
Now consider testing multiple restrictions simultaneously with a joint null hypothesis: \[ H_0:Q'\beta = c \] where \(Q\) is a \(k\times q\) matrix of rank \(q\) (matching our setup from Constrained Least Squares).
The \(F\)-test asks whether imposing these \(q\) linear restrictions makes the model fit the data substantially worse. The classical \(F\)-statistic compares the restricted and unrestricted Sum of Squared Residuals (SSR): \[ F = \frac{(\text{SSR}_R-\text{SSR}_U)/q}{\text{SSR}_U/(N-k)} \sim F_{q,N-k} \qquad \text{under } H_0 \]
An equivalent formulation uses only the unrestricted estimates. This is the Wald form: \[ F_W = \frac{(Q'\hat\beta - c)'\bigl[Q'(X'X)^{-1}Q\bigr]^{-1}(Q'\hat\beta - c)}{\tilde\sigma^2\,q} = F \]
This statistic is numerically identical to the SSR-based \(F\)-statistic. ::: {.vspace} :::
The interpretation is intuitive: it measures how far the unrestricted estimates \(Q'\hat\beta\) are from the hypothesized values \(c\), weighted by the precision (variance-covariance) of those estimates.
Under Assumption 5 (Normality), we know that exactly: \[ F_W \sim F_{q,N-k} \]
It is theoretically useful to define the corresponding Wald statistic (without dividing by \(q\)): \[ W = q\cdot F_W \]
In large samples, even if the normality assumption fails, the Central Limit Theorem ensures that: \[ W \xrightarrow{d} \chi^2(q) \]
This is why the Wald principle extends so naturally to asymptotic inference later in the course.
The previous slides presented hypothesis testing in its critical-value form.
The \(p\)-value expresses the exact same logic from a different angle.
Observed significance level: The \(p\)-value is the probability of obtaining a test statistic at least as extreme as the one actually observed in the sample: \[ p\text{-value}=\Pr_{H_0}(\text{statistic at least as extreme as observed}) \]
For the two-sided \(t\)-test, this becomes: \[ p\text{-value} = 2\Pr(t_{N-k}\ge |t_{\text{obs}}|) = 2\bigl[1-F_{t_{N-k}}(|t_{\text{obs}}|)\bigr] \]

Equivalently, the decision rule simplifies to: \[ \text{Reject } H_0 \text{ at level } \alpha \qquad \Longleftrightarrow \qquad p\text{-value} < \alpha \]
Under Assumption 5 (Normality), the conditional density of \(y_i\) given \(x_i\) is strictly normal.
We will formally study Maximum Likelihood Estimation (MLE) later in the course. But, by now, we can take advantage of our current assumption.
Let’s introduce a fundamental mathematical object: the conditional log-likelihood function for the entire sample of \(N\) independent observations.
Because the observations are independent, the joint log-likelihood is simply the sum of the individual log-densities. In matrix notation, this is: \[ \ell_N(\beta,\sigma^2 \mid y, X) = -\frac{N}{2}\ln(2\pi) - \frac{N}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2}(y - X\beta)'(y - X\beta) \]
Notice that the final term contains the exact OLS objective function, \(S(\beta) = (y - X\beta)'(y - X\beta)\).
We can elegantly rewrite the log-likelihood as: \[ \ell_N(\beta,\sigma^2 \mid y, X) = \text{const} - \frac{1}{2\sigma^2}S(\beta) \]
Because the variance \(\sigma^2\) is strictly positive, maximizing this log-likelihood with respect to \(\beta\) is mathematically identical to minimizing the sum of squared residuals \(S(\beta)\). Therefore, under normality, OLS and MLE are exactly the same estimator for \(\beta\).
If we incorrectly assume normality (Assumption 5 fails) and maximize the Gaussian likelihood function anyway, this is known as Gaussian Quasi-Maximum Likelihood Estimation (QMLE).
Because the first-order conditions for \(\beta\) in the Gaussian likelihood depend only on the linear residuals, maximizing this “wrong” likelihood still perfectly yields the OLS estimator for \(\beta\).
This is a profound theoretical result:
Given a newly observed covariate vector \(x_{N+j}\), the natural out-of-sample predictor simply replaces the unknown parameters with our OLS estimates: \[ \hat{y}_{N+j} = x_{N+j}'\hat\beta \]
When forecasting, we must clearly distinguish between two different prediction targets, each with different sources of uncertainty:
To quantify the precision of our prediction for the realized outcome (under the assumption of homoskedasticity), we compute the MSFE:
\[ \mathbb{E}\bigl[(\hat{y}_{N+j} - y_{N+j})^2 \mid x_{N+j}\bigr] = \sigma^2 \bigl[ 1 + x_{N+j}'(X'X)^{-1}x_{N+j} \bigr] \]
This decomposes the forecast variance into two parts: (i) \(\sigma^2\), The irreducible fundamental uncertainty from the future shock \(u_{N+j}\) (no matter how much data we have, this remains); (ii) \(\sigma^2 \times x_{N+j}'(X'X)^{-1}x_{N+j}\), which is the estimation uncertainty.
Notice that the estimation uncertainty depends on a quadratic form of \(x_{N+j}\). This means that predicting outcomes for covariate profiles that are very far from the historical sample average will mechanically result in much larger forecast standard errors.
Once out-of-sample predictions are generated, how do we evaluate model performance over \(J\) hold-out periods?
Common accuracy measures:
| Measure | Formula | Notes |
|---|---|---|
| RMSE | \(\sqrt{\frac{1}{J}\sum_{j=1}^J (\hat{y}_{N+j} - y_{N+j})^2}\) | Scale-dependent; quadratically penalizes large errors. |
| MAE | \(\frac{1}{J}\sum_{j=1}^J |\hat{y}_{N+j} - y_{N+j}|\) | Scale-dependent; penalizes linearly, more robust to outliers. |
| Theil’s \(U\) | Normalized RMSE ratio | Dimensionless; often compares the model against a naive “random walk” baseline. |
The properties of OLS build sequentially. Each additional assumption buys us a stronger statistical guarantee.
| Assumption | Content | Enables |
|---|---|---|
| A1: Linearity | \(y = X\beta + u\) | OLS can be formulated algebraically |
| A2: Full Rank | \(\operatorname{rank}(X) = k\) | Unique closed-form OLS solution |
| A3: Zero Cond. Mean | \(\mathbb{E}[u \mid X] = 0\) | Finite-sample unbiasedness |
| A4: Spherical Errors | \(\mathbb{E}[uu'\mid X] = \sigma^2 I_N\) | Gauss–Markov (BLUE); \(\operatorname{Var}(\hat\beta\mid X) = \sigma^2(X'X)^{-1}\) |
| A5: Normality | \(u\mid X \sim \mathcal{N}(0,\sigma^2 I_N)\) | Exact \(t\) and \(F\) distributions |
The OLS estimator: \[\hat\beta = (X'X)^{-1}X'y\]
Geometric tools:
Frisch-Waugh-Lovell (FWL) Theorem: Partialling out control variables \(X_1\) from both \(y\) and \(X_2\) (using the annihilator \(M_1\)) is mathematically equivalent to the full regression. The estimated coefficient on \(X_2\) remains exactly the same.
Finite-sample properties (Relying strictly on Assumptions 1–4):
Exact Inference (Adding Assumption 5 - Normality):