Magíster en Economía
Teoría Econométrica (Econometric Theory)
Main idea: MLE chooses the parameter vector that makes the observed sample look most plausible under a specified probabilistic model.
In OLS, we estimate parameters by minimizing a quadratic loss: \[ \hat\beta_{OLS} = \operatorname*{arg\,min}_\beta (y-X\beta)'(y-X\beta). \]
MLE also solves an optimization problem, but from a different angle:
Instead of minimizing a loss, MLE chooses the parameter vector that makes the observed sample most likely under a fully specified probability model.
This makes MLE both:
There is a direct bridge from OLS to MLE.
In the Gaussian linear regression model, \[ y\mid X \sim \mathcal{N}(X\beta,\sigma^2 I_n), \] the MLE for \(\beta\) is exactly \[ \hat\beta_{MLE}=\hat\beta_{OLS}. \]
So OLS is not outside the likelihood framework. It is a special case of MLE under normal disturbances.
This is one reason MLE is a natural continuation of the OLS chapter.
Relative to OLS or GMM, MLE has a clear advantage:
But there is also a cost:
MLE requires stronger structure because we must specify a probability model for the data.
That is, we move from moment conditions to a full density or probability mass function.
Suppose we observe \[ \{w_i\}_{i=1}^n, \] where each \(w_i\) has density or pmf \[ f(w_i;\theta), \qquad \theta\in\Theta\subset\mathbb{R}^k. \]
If the observations are independent, then the joint density is \[ f(w_1,\ldots,w_n;\theta)=\prod_{i=1}^n f(w_i;\theta). \]
Definition — Likelihood Function
Given the observed sample, the likelihood function is \(\mathcal{L}_n(\theta)=\prod_{i=1}^n f(w_i;\theta)\)
Notice that is viewed as a function of \(\theta\).
Definition — MLE
The Maximum Likelihood Estimator is \[ \hat\theta_{MLE} = \operatorname*{arg\,max}_{\theta\in\Theta}\mathcal{L}_n(\theta). \]
Intuition: among all parameter values, choose the one under which the observed sample looks most plausible.
Different values of \(\theta\) imply different probability models for the data. MLE picks the best-fitting one.
In practice, we maximize \[ \ell_n(\theta)=\ln \mathcal{L}_n(\theta) = \sum_{i=1}^n \ln f(w_i;\theta). \]
Why?
Definition
\[ \hat\theta_{MLE} = \operatorname*{arg\,max}_{\theta\in\Theta}\ell_n(\theta). \]
Definition — Score and Hessian
The score is the gradient of the log-likelihood: \[ s_n(\theta) = \frac{\partial \ell_n(\theta)}{\partial \theta} = \sum_{i=1}^n s_i(\theta). \]
The Hessian is the matrix of second derivatives: \[ H_n(\theta) = \frac{\partial^2 \ell_n(\theta)}{\partial \theta\,\partial \theta'} = \sum_{i=1}^n H_i(\theta). \]
At an interior optimum: \(s_n(\hat\theta)=0\). This is the likelihood analogue of the OLS normal equations.
Definition — Fisher Information
The Fisher information in one observation is \[ \mathcal{I}(\theta)=\mathbb{E}[s_i(\theta)s_i(\theta)']. \]
Under standard regularity conditions, \[ \mathcal{I}(\theta) = -\mathbb{E}[H_i(\theta)]. \]
If the log-likelihood is sharply curved around the truth, the data are very informative about \(\theta\).
If the likelihood is flat, the data are less informative.
Suppose \[ z_i \overset{iid}{\sim} \mathcal{N}(\mu,1), \qquad i=1,\ldots,n. \]
The density is \[ f(z_i;\mu) = \frac{1}{\sqrt{2\pi}} \exp\left\{-\frac{(z_i-\mu)^2}{2}\right\}. \]
Hence the log-likelihood is \[ \ell_n(\mu) = -\frac{n}{2}\ln(2\pi) -\frac{1}{2}\sum_{i=1}^n (z_i-\mu)^2. \]
The score is \[ s_n(\mu)=\sum_{i=1}^n (z_i-\mu). \]
Set the score equal to zero: \[ \sum_{i=1}^n (z_i-\hat\mu)=0. \]
Then \(\hat\mu_{MLE}=\bar z\), and the Hessian is \(H_n(\mu)=-n\), so the information in one observation is \(\mathcal{I}(\mu)=1.\)
This simple example already shows the full MLE logic:
specify a density \(\rightarrow\) write the log-likelihood \(\rightarrow\) compute the score \(\rightarrow\) solve the FOC.
Now consider \[ y = X\beta + u, \qquad u\mid X \sim \mathcal{N}(0,\sigma^2 I_n). \]
Then \[ y\mid X \sim \mathcal{N}(X\beta,\sigma^2 I_n), \] and the log-likelihood is \[ \ell_n(\beta,\sigma^2) = -\frac{n}{2}\ln(2\pi) -\frac{n}{2}\ln(\sigma^2) -\frac{1}{2\sigma^2}(y-X\beta)'(y-X\beta). \]
For fixed \(\sigma^2\), maximizing the log-likelihood with respect to \(\beta\) is equivalent to minimizing the sum of squared residuals.
Differentiate with respect to \(\beta\): \[ \frac{\partial \ell_n(\beta,\sigma^2)}{\partial \beta} = \frac{1}{\sigma^2}X'(y-X\beta). \]
Set equal to zero: \[ X'X\hat\beta=X'y. \]
Therefore, \[ \hat\beta_{MLE}=(X'X)^{-1}X'y=\hat\beta_{OLS}. \]
For the variance parameter, \[ \hat\sigma^2_{MLE} = \frac{(y-X\hat\beta)'(y-X\hat\beta)}{n}. \]
This differs from the usual unbiased OLS estimator of \(\sigma^2\), which divides by \(n-k\) rather than \(n\).
For MLE asymptotics, we need:
Regularity conditions for MLE
These conditions ensure that the expected log-likelihood is uniquely maximized at \(\theta_0\) and that derivatives behave well enough for Taylor expansions and probabilistic limits.
Let \[ Q_n(\theta)=\frac{1}{n}\ell_n(\theta). \]
Under a suitable LLN, \[ Q_n(\theta)\xrightarrow{p} Q(\theta)=\mathbb{E}[\ell_i(\theta)] \] uniformly over \(\theta\).
If the population criterion \(Q(\theta)\) is uniquely maximized at \(\theta_0\), then \[ \hat\theta_{MLE}\xrightarrow{p} \theta_0. \]
Consistency of MLE
Under standard regularity conditions, \[ \hat\theta_{MLE}\xrightarrow{\,p\,}\theta_0. \]
MLE is an extremum estimator: consistency comes from the sample criterion converging to a population criterion with a unique maximizer.
The score satisfies \[ s_n(\hat\theta)=0. \]
Expand around \(\theta_0\): \[ 0 = s_n(\theta_0) + H_n(\tilde\theta)(\hat\theta-\theta_0), \] where \(\tilde\theta\) lies between \(\hat\theta\) and \(\theta_0\).
Rearranging, \[ \sqrt{n}(\hat\theta-\theta_0) = - \left[\frac{1}{n}H_n(\tilde\theta)\right]^{-1} \left[\frac{1}{\sqrt{n}}s_n(\theta_0)\right]. \]
This isolates the two key objects: the score and the Hessian
Because \[ s_n(\theta_0)=\sum_{i=1}^n s_i(\theta_0), \] and \[ \mathbb{E}[s_i(\theta_0)]=0, \] the CLT gives \[ \frac{1}{\sqrt{n}}s_n(\theta_0) \xrightarrow{d} \mathcal{N}(0,\mathcal{I}(\theta_0)). \]
Since \(\hat\theta\xrightarrow{p}\theta_0\), we also have \(\tilde\theta\xrightarrow{p}\theta_0\), and by LLN \[ -\frac{1}{n}H_n(\tilde\theta)\xrightarrow{p} \mathcal{I}(\theta_0). \]
By Slutsky’s theorem, \[ \sqrt{n}(\hat\theta_{MLE}-\theta_0) \xrightarrow{d} \mathcal{N}\bigl(0,\mathcal{I}(\theta_0)^{-1}\bigr). \]
Asymptotic normality of MLE
Under standard regularity conditions, \[ \sqrt{n}(\hat\theta_{MLE}-\theta_0) \xrightarrow{d} \mathcal{N}\bigl(0,\mathcal{I}(\theta_0)^{-1}\bigr). \]
This is the foundation for standard errors, confidence intervals, and large-sample hypothesis tests in MLE.
The asymptotic variance depends on the unknown information matrix, so we need to estimate it.
Three common approaches are:
where \[ A_n=-\frac{1}{n}H_n(\hat\theta), \qquad B_n=\frac{1}{n}\sum_{i=1}^n s_i(\hat\theta)s_i(\hat\theta)'. \]
For an unbiased scalar estimator \(\tilde\theta\), \[ \mathbb{V}(\tilde\theta)\geq \frac{1}{\mathcal{I}_n(\theta_0)}, \] where \[ \mathcal{I}_n(\theta_0)=n\mathcal{I}(\theta_0). \]
Intuition: the more sharply the likelihood responds to changes in \(\theta\), the more informative the data are, and the smaller the lower bound on variance.
This is the Cramér–Rao lower bound.
MLE is not necessarily unbiased in finite samples, so the finite-sample Cramér–Rao result should be interpreted carefully.
What is true is that, under correct specification, \[ \sqrt{n}(\hat\theta_{MLE}-\theta_0) \xrightarrow{d} \mathcal{N}(0,\mathcal{I}(\theta_0)^{-1}), \] and this is the smallest asymptotic covariance matrix available among regular estimators.
Asymptotic efficiency of MLE
Under correct specification and standard regularity conditions, MLE attains the asymptotic information bound.
Invariance of MLE
If \(\gamma=g(\theta)\) and \(\hat\theta_{MLE}\) is the MLE of \(\theta\), then \[ \hat\gamma_{MLE}=g(\hat\theta_{MLE}). \]
Example:
if the MLE of \(\sigma^2\) is \(\hat\sigma^2_{MLE}\), then the MLE of \(\sigma\) is simply \[ \hat\sigma_{MLE}=\sqrt{\hat\sigma^2_{MLE}}. \]
You do not need to solve a new optimization problem for every smooth transformation of the parameters.
Suppose \[ H_0:R\theta=r. \]
The Wald statistic is \[ W = (R\hat\theta-r)' \bigl[ R\widehat{\operatorname{Var}}(\hat\theta)R' \bigr]^{-1} (R\hat\theta-r). \]
Under \(H_0\), \[ W \xrightarrow{d} \chi_q^2. \]
The Wald test asks whether the unrestricted estimate lies far from the null restriction once that distance is scaled by estimation uncertainty.
Let \(\hat\theta_u\) be the unrestricted MLE and \(\tilde\theta_r\) the restricted MLE.
Likelihood Ratio test \[ LR = 2\bigl[\ell_n(\hat\theta_u)-\ell_n(\tilde\theta_r)\bigr] \xrightarrow{d} \chi_q^2. \]
Score / LM test \[ LM = s_n(\tilde\theta_r)' \bigl[n\widehat{\mathcal{I}}(\tilde\theta_r)\bigr]^{-1} s_n(\tilde\theta_r) \xrightarrow{d} \chi_q^2. \]
All three tests are asymptotically equivalent under the null, but they differ in how much estimation they require.
The real power of MLE appears when OLS is no longer natural.
Examples:
The logic remains the same: specify a probabilistic model, write the likelihood, optimize it, and use information-based asymptotic theory for inference.
MLE chooses the parameter vector that makes the observed sample most plausible under a specified probability model.
The score, Hessian, and Fisher information organize both estimation and inference.
In the Gaussian linear model, MLE connects directly back to OLS.
Under standard regularity conditions, \[ \sqrt{n}(\hat\theta_{MLE}-\theta_0) \xrightarrow{d} \mathcal{N}\bigl(0,\mathcal{I}(\theta_0)^{-1}\bigr). \]
Wald, LR, and Score tests emerge naturally in the likelihood framework.
\[\,\]