Magíster en Economía
Teoría Econométrica (Econometric Theory)
Common thread: each method keeps part of the OLS logic while relaxing one specific restriction. OLS is the benchmark — these are extensions, not replacements.
(efficient estimation under nonspherical errors)
Consider the linear model
\[y = X\beta + u, \qquad \mathbb{E}[u\mid X]=0, \qquad \mathbb{V}(u\mid X)=\sigma^2\Omega.\]
Benchmark OLS: assumed \(\Omega = I_n\) (homoskedastic, uncorrelated errors).
When \(\Omega \neq I_n\), two problems arise:
The correct sandwich formula is
\[\mathbb{V}(\hat{\beta}_{OLS}\mid X) = \sigma^2(X'X)^{-1}(X'\Omega X)(X'X)^{-1}.\]
OLS remains unbiased and consistent, but inference is wrong and estimation is inefficient.
Suppose \(\Omega\) is known and symmetric positive definite. Then \(\Omega^{-1/2}\) exists and satisfies \(\Omega^{-1/2}\Omega\,\Omega^{-1/2}=I_n\).
Premultiply the model by \(\Omega^{-1/2}\):
\[\underbrace{\Omega^{-1/2}y}_{y^*} = \underbrace{\Omega^{-1/2}X}_{X^*}\beta + \underbrace{\Omega^{-1/2}u}_{u^*}\]
The transformed model \(y^*=X^*\beta+u^*\) satisfies
\[\mathbb{E}[u^*\mid X]=0, \qquad \mathbb{V}(u^*\mid X)=\sigma^2 I_n.\]
Classical OLS assumptions are restored. Applying OLS to the transformed model gives the GLS estimator:
\[\hat{\beta}_{GLS} = (X^{*\prime}X^*)^{-1}X^{*\prime}y^* = (X'\Omega^{-1}X)^{-1}X'\Omega^{-1}y.\]
GLS minimizes a weighted sum of squared residuals:
\[\hat{\beta}_{GLS} = \arg\min_{\beta}\; (y-X\beta)'\Omega^{-1}(y-X\beta).\]
Proposition — Finite-Sample Variance of GLS
\[\mathbb{V}(\hat{\beta}_{GLS}\mid X) = \sigma^2(X'\Omega^{-1}X)^{-1}.\]
Special case — diagonal \(\Omega\): when \(\Omega=\operatorname{diag}(\omega_1,\ldots,\omega_n)\),
\[\hat{\beta}_{GLS} = \left(\sum_{i=1}^n \frac{x_i x_i'}{\omega_i}\right)^{-1} \left(\sum_{i=1}^n \frac{x_i y_i}{\omega_i}\right).\]
This is Weighted Least Squares (WLS): observations with larger error variance receive lower weight.
When \(\Omega=I_n\), GLS collapses to OLS. OLS is a special case of GLS.
In practice \(\Omega\) is unknown. Suppose \(\Omega=\Omega(\gamma_0)\) for a low-dimensional parameter \(\gamma_0\).
FGLS procedure:
\[\hat{\beta}_{FGLS} = (X'\hat{\Omega}^{-1}X)^{-1}X'\hat{\Omega}^{-1}y.\]
Common examples of covariance structures:
If \(\hat{\Omega}\xrightarrow{\,p\,}\Omega\), FGLS behaves asymptotically like infeasible GLS.
Derivation sketch: substitute \(y=X\beta_0+u\),
\[\sqrt{n}(\hat{\beta}_{FGLS}-\beta_0) = \left(\frac{X'\hat{\Omega}^{-1}X}{n}\right)^{-1} \left(\frac{X'\hat{\Omega}^{-1}u}{\sqrt{n}}\right).\]
By LLN and CLT:
\[\frac{X'\hat{\Omega}^{-1}X}{n}\xrightarrow{\,p\,} Q_{X\Omega^{-1}X}, \qquad \frac{X'\hat{\Omega}^{-1}u}{\sqrt{n}}\xrightarrow{\,d\,} \mathcal{N}(0,\,\sigma^2 Q_{X\Omega^{-1}X}).\]
Therefore,
\[\sqrt{n}(\hat{\beta}_{FGLS}-\beta_0) \xrightarrow{\,d\,} \mathcal{N}\!\left(0,\;\sigma^2 Q_{X\Omega^{-1}X}^{-1}\right).\]
Proof.
Remark
FGLS gains asymptotic efficiency over OLS, but is more sensitive to misspecification of \(\Omega\). In practice, OLS with robust standard errors is often preferred for its robustness.
(least-squares principle with a nonlinear regression function)
OLS assumes \(\mathbb{E}[y_i\mid x_i]=x_i'\beta\) — linear in the parameters.
NLLS relaxes linearity in the parameters. Suppose
\[y_i = m(x_i,\beta) + u_i, \qquad \mathbb{E}[u_i\mid x_i]=0,\]
where \(m(x_i,\beta)\) is nonlinear in \(\beta\).
Examples of nonlinear regression functions:
The NLLS estimator minimizes the sum of squared residuals:
\[\hat{\beta}_{NLLS} = \arg\min_{\beta}\; S_n(\beta), \qquad S_n(\beta)=\sum_{i=1}^n \bigl(y_i-m(x_i,\beta)\bigr)^2.\]
Let \(r_i(\beta)=y_i-m(x_i,\beta)\) denote the residual. The FOC is
\[\frac{\partial S_n(\beta)}{\partial \beta} = -2\sum_{i=1}^n \frac{\partial m(x_i,\beta)}{\partial \beta}\bigl(y_i-m(x_i,\beta)\bigr) = 0.\]
Defining the Jacobian matrix
\[G(\beta) = \begin{bmatrix}\partial m(x_1,\beta)/\partial \beta'\\ \vdots\\ \partial m(x_n,\beta)/\partial \beta'\end{bmatrix},\]
the FOC becomes \(G(\beta)'(y-m(X,\beta))=0\). Notice that, unlike OLS, the FOCs are nonlinear in \(\beta\) (that is, no closed-form solution exists). So, numerical methods are required.
Approximate \(m(x_i,\beta)\) around a current guess \(\beta^{(0)}\) by a first-order Taylor expansion:
\[m(x_i,\beta) \approx m(x_i,\beta^{(0)}) + \frac{\partial m(x_i,\beta^{(0)})}{\partial \beta'}(\beta-\beta^{(0)}).\]
Substituting this linear approximation into the least-squares objective yields a local linear problem. The update is
\[\beta^{(1)} = \beta^{(0)} + \bigl[G(\beta^{(0)})'G(\beta^{(0)})\bigr]^{-1} G(\beta^{(0)})'\bigl(y-m(X,\beta^{(0)})\bigr).\]
Repeat until convergence.
\[\,\]
Note: Each Gauss–Newton iteration solves an OLS problem built from a local linear approximation. The connection to OLS is transparent.
Let \(Q_n(\beta)=\tfrac{1}{n}S_n(\beta)\). If the model is correctly specified, \(\beta_0\) uniquely minimizes \[Q(\beta)=\mathbb{E}\bigl[(y_i-m(x_i,\beta))^2\bigr].\] By a uniform LLN: \(\hat{\beta}_{NLLS}\xrightarrow{\,p\,}\beta_0\).
Asymptotic normality. Define \(g_i(\beta)=\partial m(x_i,\beta)/\partial\beta\). Expanding the FOC around \(\beta_0\),
\[\sqrt{n}(\hat{\beta}_{NLLS}-\beta_0) = A_n^{-1}\,\frac{1}{\sqrt{n}}\sum_{i=1}^n g_i(\beta_0)u_i + o(1),\]
where \(A_n = \tfrac{1}{n}\sum_i g_i(\beta_0)g_i(\beta_0)'\). By LLN and CLT:
\[\sqrt{n}(\hat{\beta}_{NLLS}-\beta_0)\xrightarrow{\,d\,} \mathcal{N}(0,\,A^{-1}BA^{-1}),\]
with \(A=\mathbb{E}[g_i g_i']\) and \(B=\mathbb{E}[u_i^2 g_i g_i']\).
Under conditional homoskedasticity (\(B=\sigma^2 A\)), the variance simplifies to \(\sigma^2 A^{-1}\).
A simple feedforward neural network (one hidden layer) can be written as
\[m(x_i,\theta) = \alpha_0 + \alpha'x_i + \sum_{j=1}^q \gamma_j\, G(x_i'\delta_j),\]
where \(G(\cdot)\) is an activation function (e.g., logistic). Estimating this with squared-error loss is an NLLS problem.
What changes in modern machine learning:
From an econometric perspective, neural networks are flexible nonlinear approximators. They extend the NLLS logic rather than replacing it. Machine learning adds scale, computation, regularization, and a focus on prediction.
(penalized least squares for stability and variable selection)
OLS behaves poorly when:
The key idea: keep the linear predictor \(y_i \approx x_i'\beta\), but add a penalty that shrinks coefficients toward zero.
This introduces bias but drastically reduces variance — the classic bias–variance tradeoff.
Two approaches depending on the penalty function:
| Method | Penalty | Variable Selection? |
|---|---|---|
| Ridge | \(\sum_j \beta_j^2\) (\(L_2\)) | No — smooth shrinkage |
| LASSO | \(\sum_j |\beta_j|\) (\(L_1\)) | Yes — exact zeros |
Ridge solves
\[\hat{\beta}_{Ridge} = \arg\min_{\beta} \left\{\frac{1}{n}\sum_{i=1}^n (y_i-x_i'\beta)^2 + \lambda\sum_{j=1}^p \beta_j^2\right\}.\]
The quadratic penalty yields a closed-form solution:
\[\hat{\beta}_{Ridge} = (X'X + n\lambda I_p)^{-1}X'y.\]
Orthogonal design (\(X'X/n=I_p\)): each coefficient shrinks by the same factor,
\[\hat{\beta}_{j,Ridge} = \frac{1}{1+\lambda}\,\hat{\beta}_{j,OLS}.\]
LASSO solves
\[\hat{\beta}_{LASSO} = \arg\min_{\beta} \left\{\frac{1}{2n}\sum_{i=1}^n (y_i-x_i'\beta)^2 + \lambda\sum_{j=1}^p |\beta_j|\right\}.\]
The \(L_1\) penalty has a kink at zero — it can set coefficients exactly to zero: LASSO performs variable selection.
Orthogonal design — soft-thresholding solution:
\[\hat{\beta}_{j,LASSO} = \operatorname{sgn}(\hat{\beta}_{j,OLS})\max\!\left\{|\hat{\beta}_{j,OLS}|-\lambda,\;0\right\}.\]
When regressors are highly correlated, LASSO arbitrarily picks one and drops others. Ridge shrinks correlated variables together but does not select.
The Elastic Net combines both penalties:
\[\hat{\beta}_{ElasticNet} = \arg\min_{\beta}\left\{\frac{1}{2n}\sum_{i=1}^n (y_i-x_i'\beta)^2 + \lambda\sum_{j=1}^p\!\left(\alpha|\beta_j| + \frac{1-\alpha}{2}\beta_j^2\right)\right\},\]
where \(\alpha\in[0,1]\) controls the mix:
Elastic Net produces sparse models like LASSO while maintaining the grouping effect of Ridge for correlated variables.
The tuning parameter \(\lambda\) governs the bias–variance tradeoff:
Choosing \(\lambda\): not by hypothesis testing, but by \(K\)-fold cross-validation.
Asymptotic comments (fixed-dimensional case):
Ridge and LASSO are still linear models — the fitted predictor remains linear in the covariates. What changes is the estimation criterion. They extend OLS, they do not replace it.
(modeling conditional quantiles beyond the conditional mean)
OLS targets \(\mathbb{E}[y_i\mid x_i]\) — the conditional mean.
But many questions concern other parts of the distribution:
Quantile regression models conditional quantiles:
\[Q_{\tau}(y_i\mid x_i) = x_i'\beta(\tau), \qquad \tau\in(0,1).\]
Instead of one coefficient vector, we estimate a family \(\beta(\tau)\) indexed by the quantile level.
\(\beta_j(\tau)\) measures the marginal effect of \(x_{ij}\) on the \(\tau\)-th conditional quantile of \(y_i\). If \(\beta_j(\tau)\) varies with \(\tau\), effects are heterogeneous across the outcome distribution.
Median (\(\tau=0.5\)): instead of minimizing squared residuals (OLS), minimize absolute residuals — Least Absolute Deviations (LAD):
\[\hat{\beta}_{LAD} = \arg\min_{\beta}\sum_{i=1}^n |y_i - x_i'\beta|.\]
LAD is more robust to outliers because it penalizes errors linearly, not quadratically.
General \(\tau\): replace the absolute value with the asymmetric check-loss function:
\[\rho_{\tau}(u) = u\bigl(\tau - \mathbf{1}\{u<0\}\bigr) = \begin{cases}\tau u, & u\geq 0,\\ (\tau-1)u, & u<0.\end{cases}\]
The quantile-regression estimator solves:
\[\hat{\beta}(\tau) = \arg\min_{\beta}\sum_{i=1}^n \rho_{\tau}(y_i-x_i'\beta).\]
Why does asymmetric loss target quantiles?
Consider \(\tau=0.90\):
This asymmetry forces the fitted line upward until 90% of observations lie below and 10% lie above — targeting the 90th conditional quantile.
First-order conditions (in subgradient form, since \(\rho_\tau\) is not everywhere differentiable):
\[\sum_{i=1}^n x_i\bigl(\tau - \mathbf{1}\{y_i - x_i'\beta < 0\}\bigr) = 0.\]
At the optimum, the weighted balance of positive and negative residuals is zero.
Let \(Q_{\tau}(y_i\mid x_i)=x_i'\beta_0(\tau)\). Under standard regularity conditions:
\[\sqrt{n}\bigl(\hat{\beta}(\tau)-\beta_0(\tau)\bigr)\xrightarrow{\,d\,} \mathcal{N}(0,\,V_{\tau}),\]
where
\[V_{\tau} = \tau(1-\tau)\,D_{\tau}^{-1}\,Q\,D_{\tau}^{-1},\]
with
\[Q = \mathbb{E}[x_i x_i'], \qquad D_{\tau} = \mathbb{E}\bigl[f_{u_{\tau}\mid x}(0\mid x_i)\,x_i x_i'\bigr].\]
The term \(f_{u_\tau\mid x}(0\mid x_i)\) is the conditional density of the regression residual evaluated at zero.
Note: Unlike OLS, the asymptotic variance of QR depends on the conditional density at the quantile of interest — not just second moments of the regressors. Standard OLS variance formulas do not apply.
| Method | Relaxes | Keeps | Key Feature |
|---|---|---|---|
| GLS/FGLS | Homoskedasticity + independence | Linear conditional mean | Efficient under nonspherical errors |
| NLLS | Linearity in parameters | Least-squares criterion | Nonlinear regression functions |
| Ridge | OLS objective | Linear predictor | Smooth shrinkage, no selection |
| LASSO | OLS objective | Linear predictor | Shrinkage + variable selection |
| Quantile Reg. | Focus on conditional mean | Linear index | Effects at any quantile \(\tau\) |
The common lesson: the linear model is not abandoned — it is adapted. Each extension keeps the core intuition of OLS while responding to a specific empirical challenge.
\[\,\]