Extensions of the Linear Model

Magíster en Economía
Teoría Econométrica (Econometric Theory)

Prof. Luis Chancí

www.luischanci.com

Outline

  1. GLS & Feasible GLS
    efficient estimation when \(\operatorname{Var}(u\mid X)=\sigma^2\Omega\neq\sigma^2 I_n\)
  2. Nonlinear Least Squares
    least-squares principle applied to nonlinear regression functions
  3. Ridge & LASSO
    penalized estimation for collinearity, overfitting, and high dimensions
  4. Quantile Regression
    modeling conditional quantiles instead of the conditional mean

Common thread: each method keeps part of the OLS logic while relaxing one specific restriction. OLS is the benchmark — these are extensions, not replacements.

1. GLS and Feasible GLS

(efficient estimation under nonspherical errors)

Why OLS is No Longer Efficient

Consider the linear model

\[y = X\beta + u, \qquad \mathbb{E}[u\mid X]=0, \qquad \mathbb{V}(u\mid X)=\sigma^2\Omega.\]

Benchmark OLS: assumed \(\Omega = I_n\) (homoskedastic, uncorrelated errors).

When \(\Omega \neq I_n\), two problems arise:

  • OLS wastes information — it ignores the known covariance structure.
  • The classical variance formula \(\mathbb{V}(\hat{\beta}_{OLS}\mid X)=\sigma^2(X'X)^{-1}\) is incorrect.

The correct sandwich formula is

\[\mathbb{V}(\hat{\beta}_{OLS}\mid X) = \sigma^2(X'X)^{-1}(X'\Omega X)(X'X)^{-1}.\]

OLS remains unbiased and consistent, but inference is wrong and estimation is inefficient.

The GLS Transformation

Suppose \(\Omega\) is known and symmetric positive definite. Then \(\Omega^{-1/2}\) exists and satisfies \(\Omega^{-1/2}\Omega\,\Omega^{-1/2}=I_n\).

Premultiply the model by \(\Omega^{-1/2}\):

\[\underbrace{\Omega^{-1/2}y}_{y^*} = \underbrace{\Omega^{-1/2}X}_{X^*}\beta + \underbrace{\Omega^{-1/2}u}_{u^*}\]

The transformed model \(y^*=X^*\beta+u^*\) satisfies

\[\mathbb{E}[u^*\mid X]=0, \qquad \mathbb{V}(u^*\mid X)=\sigma^2 I_n.\]

Classical OLS assumptions are restored. Applying OLS to the transformed model gives the GLS estimator:

\[\hat{\beta}_{GLS} = (X^{*\prime}X^*)^{-1}X^{*\prime}y^* = (X'\Omega^{-1}X)^{-1}X'\Omega^{-1}y.\]

GLS as Weighted Least Squares

GLS minimizes a weighted sum of squared residuals:

\[\hat{\beta}_{GLS} = \arg\min_{\beta}\; (y-X\beta)'\Omega^{-1}(y-X\beta).\]

Proposition — Finite-Sample Variance of GLS

\[\mathbb{V}(\hat{\beta}_{GLS}\mid X) = \sigma^2(X'\Omega^{-1}X)^{-1}.\]

Special case — diagonal \(\Omega\): when \(\Omega=\operatorname{diag}(\omega_1,\ldots,\omega_n)\),

\[\hat{\beta}_{GLS} = \left(\sum_{i=1}^n \frac{x_i x_i'}{\omega_i}\right)^{-1} \left(\sum_{i=1}^n \frac{x_i y_i}{\omega_i}\right).\]

This is Weighted Least Squares (WLS): observations with larger error variance receive lower weight.

When \(\Omega=I_n\), GLS collapses to OLS. OLS is a special case of GLS.

Feasible GLS

In practice \(\Omega\) is unknown. Suppose \(\Omega=\Omega(\gamma_0)\) for a low-dimensional parameter \(\gamma_0\).

FGLS procedure:

  1. Estimate \(\gamma_0\) consistently: obtain \(\hat{\gamma}\).
  2. Form \(\hat{\Omega}=\Omega(\hat{\gamma})\).
  3. Compute the FGLS estimator:

\[\hat{\beta}_{FGLS} = (X'\hat{\Omega}^{-1}X)^{-1}X'\hat{\Omega}^{-1}y.\]

Common examples of covariance structures:

  • Heteroskedasticity: \(\Omega\) diagonal, diagonal entries vary with observable characteristics.
  • Serial correlation: \(\Omega\) reflects AR(1) or other structured time-series dependence.

Asymptotic Theory of FGLS

If \(\hat{\Omega}\xrightarrow{\,p\,}\Omega\), FGLS behaves asymptotically like infeasible GLS.

Derivation sketch: substitute \(y=X\beta_0+u\),

\[\sqrt{n}(\hat{\beta}_{FGLS}-\beta_0) = \left(\frac{X'\hat{\Omega}^{-1}X}{n}\right)^{-1} \left(\frac{X'\hat{\Omega}^{-1}u}{\sqrt{n}}\right).\]

By LLN and CLT:

\[\frac{X'\hat{\Omega}^{-1}X}{n}\xrightarrow{\,p\,} Q_{X\Omega^{-1}X}, \qquad \frac{X'\hat{\Omega}^{-1}u}{\sqrt{n}}\xrightarrow{\,d\,} \mathcal{N}(0,\,\sigma^2 Q_{X\Omega^{-1}X}).\]

Therefore,

\[\sqrt{n}(\hat{\beta}_{FGLS}-\beta_0) \xrightarrow{\,d\,} \mathcal{N}\!\left(0,\;\sigma^2 Q_{X\Omega^{-1}X}^{-1}\right).\]

Proof.

Remark

FGLS gains asymptotic efficiency over OLS, but is more sensitive to misspecification of \(\Omega\). In practice, OLS with robust standard errors is often preferred for its robustness.

2. Nonlinear Least Squares

(least-squares principle with a nonlinear regression function)

From Linear to Nonlinear Regression

OLS assumes \(\mathbb{E}[y_i\mid x_i]=x_i'\beta\) — linear in the parameters.

NLLS relaxes linearity in the parameters. Suppose

\[y_i = m(x_i,\beta) + u_i, \qquad \mathbb{E}[u_i\mid x_i]=0,\]

where \(m(x_i,\beta)\) is nonlinear in \(\beta\).

Examples of nonlinear regression functions:

  • Power function: \(m(x_i,\beta)=\beta_1+\beta_2\, x_i^{\beta_3}\)
  • Exponential: \(m(x_i,\beta)=\beta_1+\beta_2\, e^{\beta_3 x_i}\)
  • Threshold-type models with parameters inside indicator functions

Nonlinear Regression

The NLLS estimator minimizes the sum of squared residuals:

\[\hat{\beta}_{NLLS} = \arg\min_{\beta}\; S_n(\beta), \qquad S_n(\beta)=\sum_{i=1}^n \bigl(y_i-m(x_i,\beta)\bigr)^2.\]

Let \(r_i(\beta)=y_i-m(x_i,\beta)\) denote the residual. The FOC is

\[\frac{\partial S_n(\beta)}{\partial \beta} = -2\sum_{i=1}^n \frac{\partial m(x_i,\beta)}{\partial \beta}\bigl(y_i-m(x_i,\beta)\bigr) = 0.\]

Defining the Jacobian matrix

\[G(\beta) = \begin{bmatrix}\partial m(x_1,\beta)/\partial \beta'\\ \vdots\\ \partial m(x_n,\beta)/\partial \beta'\end{bmatrix},\]

the FOC becomes \(G(\beta)'(y-m(X,\beta))=0\). Notice that, unlike OLS, the FOCs are nonlinear in \(\beta\) (that is, no closed-form solution exists). So, numerical methods are required.

The Gauss–Newton Algorithm

Approximate \(m(x_i,\beta)\) around a current guess \(\beta^{(0)}\) by a first-order Taylor expansion:

\[m(x_i,\beta) \approx m(x_i,\beta^{(0)}) + \frac{\partial m(x_i,\beta^{(0)})}{\partial \beta'}(\beta-\beta^{(0)}).\]

Substituting this linear approximation into the least-squares objective yields a local linear problem. The update is

\[\beta^{(1)} = \beta^{(0)} + \bigl[G(\beta^{(0)})'G(\beta^{(0)})\bigr]^{-1} G(\beta^{(0)})'\bigl(y-m(X,\beta^{(0)})\bigr).\]

Repeat until convergence.

\[\,\]

Note: Each Gauss–Newton iteration solves an OLS problem built from a local linear approximation. The connection to OLS is transparent.

Asymptotic Theory of NLLS

Let \(Q_n(\beta)=\tfrac{1}{n}S_n(\beta)\). If the model is correctly specified, \(\beta_0\) uniquely minimizes \[Q(\beta)=\mathbb{E}\bigl[(y_i-m(x_i,\beta))^2\bigr].\] By a uniform LLN: \(\hat{\beta}_{NLLS}\xrightarrow{\,p\,}\beta_0\).

Asymptotic normality. Define \(g_i(\beta)=\partial m(x_i,\beta)/\partial\beta\). Expanding the FOC around \(\beta_0\),

\[\sqrt{n}(\hat{\beta}_{NLLS}-\beta_0) = A_n^{-1}\,\frac{1}{\sqrt{n}}\sum_{i=1}^n g_i(\beta_0)u_i + o(1),\]

where \(A_n = \tfrac{1}{n}\sum_i g_i(\beta_0)g_i(\beta_0)'\). By LLN and CLT:

\[\sqrt{n}(\hat{\beta}_{NLLS}-\beta_0)\xrightarrow{\,d\,} \mathcal{N}(0,\,A^{-1}BA^{-1}),\]

with \(A=\mathbb{E}[g_i g_i']\) and \(B=\mathbb{E}[u_i^2 g_i g_i']\).

Under conditional homoskedasticity (\(B=\sigma^2 A\)), the variance simplifies to \(\sigma^2 A^{-1}\).

NLLS, Neural Networks, and Machine Learning

A simple feedforward neural network (one hidden layer) can be written as

\[m(x_i,\theta) = \alpha_0 + \alpha'x_i + \sum_{j=1}^q \gamma_j\, G(x_i'\delta_j),\]

where \(G(\cdot)\) is an activation function (e.g., logistic). Estimating this with squared-error loss is an NLLS problem.

What changes in modern machine learning:

  • Many nonlinear basis functions, potentially many layers
  • Stochastic gradient methods rather than Gauss–Newton
  • Evaluated by predictive performance, not classical inference

From an econometric perspective, neural networks are flexible nonlinear approximators. They extend the NLLS logic rather than replacing it. Machine learning adds scale, computation, regularization, and a focus on prediction.

3. Ridge and LASSO

(penalized least squares for stability and variable selection)

Why Penalization?

OLS behaves poorly when:

  • Regressors are highly collinear\((X'X)\) is nearly singular, estimates are unstable.
  • The number of covariates is large relative to \(n\) — overfitting, poor out-of-sample performance.

The key idea: keep the linear predictor \(y_i \approx x_i'\beta\), but add a penalty that shrinks coefficients toward zero.

This introduces bias but drastically reduces variance — the classic bias–variance tradeoff.

Two approaches depending on the penalty function:

Method Penalty Variable Selection?
Ridge \(\sum_j \beta_j^2\) (\(L_2\)) No — smooth shrinkage
LASSO \(\sum_j |\beta_j|\) (\(L_1\)) Yes — exact zeros

Ridge Regression

Ridge solves

\[\hat{\beta}_{Ridge} = \arg\min_{\beta} \left\{\frac{1}{n}\sum_{i=1}^n (y_i-x_i'\beta)^2 + \lambda\sum_{j=1}^p \beta_j^2\right\}.\]

The quadratic penalty yields a closed-form solution:

\[\hat{\beta}_{Ridge} = (X'X + n\lambda I_p)^{-1}X'y.\]

Orthogonal design (\(X'X/n=I_p\)): each coefficient shrinks by the same factor,

\[\hat{\beta}_{j,Ridge} = \frac{1}{1+\lambda}\,\hat{\beta}_{j,OLS}.\]

  • When \(\lambda=0\): reduces to OLS.
  • As \(\lambda\uparrow\): all coefficients are pulled toward zero, but never exactly zero.

LASSO

LASSO solves

\[\hat{\beta}_{LASSO} = \arg\min_{\beta} \left\{\frac{1}{2n}\sum_{i=1}^n (y_i-x_i'\beta)^2 + \lambda\sum_{j=1}^p |\beta_j|\right\}.\]

The \(L_1\) penalty has a kink at zero — it can set coefficients exactly to zero: LASSO performs variable selection.

Orthogonal design — soft-thresholding solution:

\[\hat{\beta}_{j,LASSO} = \operatorname{sgn}(\hat{\beta}_{j,OLS})\max\!\left\{|\hat{\beta}_{j,OLS}|-\lambda,\;0\right\}.\]

  • If \(|\hat{\beta}_{j,OLS}| \leq \lambda\): coefficient is set to exactly zero.
  • If \(|\hat{\beta}_{j,OLS}| > \lambda\): coefficient survives but is shrunk by \(\lambda\).

Elastic Net

When regressors are highly correlated, LASSO arbitrarily picks one and drops others. Ridge shrinks correlated variables together but does not select.

The Elastic Net combines both penalties:

\[\hat{\beta}_{ElasticNet} = \arg\min_{\beta}\left\{\frac{1}{2n}\sum_{i=1}^n (y_i-x_i'\beta)^2 + \lambda\sum_{j=1}^p\!\left(\alpha|\beta_j| + \frac{1-\alpha}{2}\beta_j^2\right)\right\},\]

where \(\alpha\in[0,1]\) controls the mix:

  • \(\alpha=1\): pure LASSO.
  • \(\alpha=0\): pure Ridge.

Elastic Net produces sparse models like LASSO while maintaining the grouping effect of Ridge for correlated variables.

Bias–Variance Tradeoff and Tuning

The tuning parameter \(\lambda\) governs the bias–variance tradeoff:

  • Small \(\lambda\) → little shrinkage → close to OLS (low bias, high variance).
  • Large \(\lambda\) → strong shrinkage → low variance, higher bias.

Choosing \(\lambda\): not by hypothesis testing, but by \(K\)-fold cross-validation.

  1. Partition the sample into \(K\) subsets.
  2. Train on \(K-1\) subsets; evaluate on the held-out fold.
  3. Select the \(\lambda\) (and \(\alpha\) for Elastic Net) that minimizes out-of-sample MSE.

Asymptotic comments (fixed-dimensional case):

  • Ridge: consistent if \(\lambda_n\to 0\) at an appropriate rate.
  • LASSO: consistent for prediction and estimation under suitable rates; variable-selection consistency requires stronger conditions.

Ridge and LASSO are still linear models — the fitted predictor remains linear in the covariates. What changes is the estimation criterion. They extend OLS, they do not replace it.

4. Quantile Regression

(modeling conditional quantiles beyond the conditional mean)

Why Move Beyond the Conditional Mean?

OLS targets \(\mathbb{E}[y_i\mid x_i]\) — the conditional mean.

But many questions concern other parts of the distribution:

  • Do education programs help workers at the bottom of the wage distribution more than at the top?
  • Does a policy variable affect downside risk differently from upside outcomes?

Quantile regression models conditional quantiles:

\[Q_{\tau}(y_i\mid x_i) = x_i'\beta(\tau), \qquad \tau\in(0,1).\]

Instead of one coefficient vector, we estimate a family \(\beta(\tau)\) indexed by the quantile level.

\(\beta_j(\tau)\) measures the marginal effect of \(x_{ij}\) on the \(\tau\)-th conditional quantile of \(y_i\). If \(\beta_j(\tau)\) varies with \(\tau\), effects are heterogeneous across the outcome distribution.

From LAD to the Check-Loss Function

Median (\(\tau=0.5\)): instead of minimizing squared residuals (OLS), minimize absolute residuals — Least Absolute Deviations (LAD):

\[\hat{\beta}_{LAD} = \arg\min_{\beta}\sum_{i=1}^n |y_i - x_i'\beta|.\]

LAD is more robust to outliers because it penalizes errors linearly, not quadratically.

General \(\tau\): replace the absolute value with the asymmetric check-loss function:

\[\rho_{\tau}(u) = u\bigl(\tau - \mathbf{1}\{u<0\}\bigr) = \begin{cases}\tau u, & u\geq 0,\\ (\tau-1)u, & u<0.\end{cases}\]

The quantile-regression estimator solves:

\[\hat{\beta}(\tau) = \arg\min_{\beta}\sum_{i=1}^n \rho_{\tau}(y_i-x_i'\beta).\]

Intuition Behind the Check-Loss

Why does asymmetric loss target quantiles?

Consider \(\tau=0.90\):

  • Under-predictions (\(u\geq 0\)): penalized by factor \(0.90\) — heavily penalized.
  • Over-predictions (\(u<0\)): penalized by factor \(0.10\) — lightly penalized.

This asymmetry forces the fitted line upward until 90% of observations lie below and 10% lie above — targeting the 90th conditional quantile.

First-order conditions (in subgradient form, since \(\rho_\tau\) is not everywhere differentiable):

\[\sum_{i=1}^n x_i\bigl(\tau - \mathbf{1}\{y_i - x_i'\beta < 0\}\bigr) = 0.\]

At the optimum, the weighted balance of positive and negative residuals is zero.

Asymptotic Normality of QR

Let \(Q_{\tau}(y_i\mid x_i)=x_i'\beta_0(\tau)\). Under standard regularity conditions:

\[\sqrt{n}\bigl(\hat{\beta}(\tau)-\beta_0(\tau)\bigr)\xrightarrow{\,d\,} \mathcal{N}(0,\,V_{\tau}),\]

where

\[V_{\tau} = \tau(1-\tau)\,D_{\tau}^{-1}\,Q\,D_{\tau}^{-1},\]

with

\[Q = \mathbb{E}[x_i x_i'], \qquad D_{\tau} = \mathbb{E}\bigl[f_{u_{\tau}\mid x}(0\mid x_i)\,x_i x_i'\bigr].\]

The term \(f_{u_\tau\mid x}(0\mid x_i)\) is the conditional density of the regression residual evaluated at zero.


Note: Unlike OLS, the asymptotic variance of QR depends on the conditional density at the quantile of interest — not just second moments of the regressors. Standard OLS variance formulas do not apply.

Summary: Extensions of OLS

Method Relaxes Keeps Key Feature
GLS/FGLS Homoskedasticity + independence Linear conditional mean Efficient under nonspherical errors
NLLS Linearity in parameters Least-squares criterion Nonlinear regression functions
Ridge OLS objective Linear predictor Smooth shrinkage, no selection
LASSO OLS objective Linear predictor Shrinkage + variable selection
Quantile Reg. Focus on conditional mean Linear index Effects at any quantile \(\tau\)

The common lesson: the linear model is not abandoned — it is adapted. Each extension keeps the core intuition of OLS while responding to a specific empirical challenge.

Cierre



¿Preguntas?

\[\,\]

O vía E-mail:
luis.chanci@usach.cl
luischanci@santotomas.cl