We assume that data $\bm{x}_n = \{x_1,x_2,\dots,x_n\}$ is generated from the true model $g(x)$, and consider the $p$-dimensional parametric model $\mathcal{F} = \{f(x; \theta)\ :\ \theta\in\Theta\subset\mathbb{R}^p \}$. Our goal is to evaluate the maximum likelihood estimator $\hat{\theta} \in \Theta$.

This procedure is equal to evaluate the average performance of $f(z;\hat{\theta})$ under the true distribution $g(z)$ for new data $Z = z$. We can use KL-divergence as the measurement of the difference between two distributions:

\[\begin{aligned} D_{KL}[g(z) \| f(z; \hat{\theta})] &= \mathbb{E}_g\left[\ln\frac{g(Z)}{f(Z; \hat{\theta})}\right] \\ &= \mathbb{E}_g[\ln g(Z)] - \mathbb{E}_g[\ln f(Z; \hat{\theta})]. \end{aligned}\]

From the properties of KL-divergence, the larger the value of the expected log-likelihood, the closer the model is to the true model.

\[\mathbb{E}_g[\ln f(Z; \hat{\theta})] = \int \ln f(z; \hat{\theta})dg(z).\]

Replacing the unknown probability distribution $g$ by empirical distribution $\hat{g}$, we obtain

\[\begin{aligned} \mathbb{E}_{\hat{g}}\left[\ln f(Z; \hat{\theta})\right] &= \int \ln f(z; \hat{\theta})d\hat{g}(z) \\ &= \frac{1}{n}\sum^n_{i=1}\ln f(x_i; \hat{\theta}), \end{aligned}\]

and the log-likelihood of $f(z; \hat{\theta})$ is

\[\ell(\hat{\theta}) = \sum^n_{i=1} \ln f(x_i; \hat{\theta}).\]

Here, the estimator of $\mathbb{E}_g[\ln f(Z; \hat{\theta})]$ is $\ell(\hat{\theta}) / n$, and $\ell(\hat{\theta})$ is the estimator of $n\mathbb{E}_g[\ln f(Z; \hat{\theta})]$.

Why do we need bias correction?

In practice, it is difficult to accurately capture the true distribution from a limited number of observations. Therefore, we consider constructing several candidate statistical models based on the observed data and selecting the model that best approximates the unknown distribution among them. Here, we assume that there exists $m$ models $\{f_j(z; \theta_j) : j=1,2,\dots,m \}$ and there estimators $\hat{\theta}_j$.

Therefore, we consider constructing several candidate statistical models based on the observed data and selecting the model that best approximates the unknown distribution among them $\ell_j(\hat{\theta}_j)$. In practice, however, it is known that this method does not allow for fair model comparisons. This is because $\ell_j(\hat{\theta}_j)$ has a bias as an estimator of the expected log-likelihood of the maximum likelihood model $n\mathbb{E}_g[\ln f_j(z; \hat{\theta}_j)]$, and the magnitude of the bias depends on the dimension of the parameters. This bias is due to the fact that the same data is used for estimating the parameters and estimating the goodness-of-fit measure of the estimated model.

Derivation of bias and information criterion

The maximum likelihood estimator $\hat{\theta}$ is given as $\theta = \hat{\theta}$, which maximizes the log-likelihood function $\ell(\theta) = \sum^n_{i=1}\ln f(X_i; \theta)$, or the solution of the following likelihood equation.

\[\frac{\partial \ell(\theta)}{\partial\theta} = \sum^n_{i=1}\frac{\partial}{\partial\theta}\ln f(X_i; \theta) = 0.\]

The expectation is

\[\mathbb{E}_{g(x_n)}\left[\sum^n_{i=1}\frac{\partial}{\partial\theta}\ln f(X_i; \theta)\right] = n\mathbb{E}_{g(z)}\left[\frac{\partial}{\partial\theta}\ln f(Z; \theta)\right],\]

and let $\theta_0$ be the solution for the continuous model

\[\mathbb{E}_{g(z)}\left[\frac{\partial}{\partial\theta}\ln f(Z; \theta)\right] = \int g(z)\frac{\partial}{\partial\theta}\ln f(z; \theta)dz = 0.\]

It can be shown that the maximum likelihood estimator $\hat{\theta}$ converges to $\theta_0$ in probability as $n\to\infty$.

From the above discussion, we evaluate the bias

\[b(g) = \mathbb{E}_{g(x_n)}[\ln f(X_i; \hat{\theta}(X_n))] - n\mathbb{E}_{g(z)}[\ln f(Z; \hat{\theta}(X_n))].\]

We can decompose $b(g)$ as

\[\begin{aligned} & \mathbb{E}_{g(x_n)}[\ln f(X_i; \hat{\theta}(X_n))] - n\mathbb{E}_{g(z)}[\ln f(Z; \hat{\theta}(X_n))] \\ &= \mathbb{E}_{g(x_n)}[\ln f(X_n; \hat{\theta}(X_n)) - \ln f(X_n; \theta_0)] \\ &\quad + \mathbb{E}_{g(x_n)}[\ln f(X_n; \theta_0) - n\mathbb{E}_{g(z)}[\ln f(Z; \theta_0)]] \\ &\quad + \mathbb{E}_{g(x_n)}[n\mathbb{E}_{g(z)}[\ln f(Z; \theta_0)] - n\mathbb{E}_{g(z)}[\ln f(Z; \hat{\theta}(X_n))]] \\ &= D_1 + D_2 + D_3. \end{aligned}\]

Next, we compute $D_1$, $D_2$ and $D_3$ as follows.

(i) Computation of $D_2$

\[\begin{aligned} D_2 &= \mathbb{E}_{g(x_n)}[\ln f(X_n; \theta_0) - n\mathbb{E}_{g(z)}[\ln f(Z; \theta_0)]] \\ &= \mathbb{E}_{g(x_n)}\left[\sum^n_{i=1}\ln f(X_i; \theta_0)\right] - n\mathbb{E}_{g(z)}[\ln f(Z; \theta_0)] \\ &= 0. \end{aligned}\]

(ii) Computation of $D_3$

First, let

\[\eta(\hat{\theta}) = \mathbb{E}_{g(z)}[\ln f(Z; \hat{\theta})].\]

From the Taylor expansion of $\eta(\hat{\theta})$ around $\theta_0$, we have

\[\begin{aligned} \eta(\hat{\theta}) = \eta(\theta_0) &+ \sum^p_{i=1}(\hat{\theta}_i - \theta^{(0)}_i)\frac{\partial\eta(\theta_0)}{\partial\theta_i} \\ &+ \frac{1}{2}\sum^p_{i=1}\sum^p_{j=1}(\hat{\theta}_i - \theta^{(0)}_i)(\hat{\theta}_j - \theta^{(0)}_j)\frac{\partial^2\eta(\theta_0)}{\partial\theta_i\partial\theta_j} + \cdots. \end{aligned}\]

Since $\theta_0$ is a solution of equation, it is obvious that

\[\frac{\partial\eta(\theta_0)}{\partial\theta_i} = \mathbb{E}_{g(z)}\left[\left.\frac{\partial}{\partial\theta_i}\ln f(Z; \theta)\right|_{\theta_0}\right] = 0, \quad i=1,2,\dots,p.\]

Then, we have

\[\eta(\hat{\theta}) = \eta(\theta_0) - \frac{1}{2}(\hat{\theta} - \theta_0)^\top J(\theta_0) (\hat{\theta} - \theta_0),\]

where $p\times p$ matrix $J(\theta_0)$ is

\[J(\theta_0) = -\mathbb{E}_{g(z)}\left[\left.\frac{\partial^2\ln f(Z; \theta)}{\partial\theta\partial\theta^\top}\right|_{\theta_0}\right] = - \int g(z)\left.\frac{\partial^2\ln f(z; \theta)}{\partial\theta\partial\theta^\top}\right|_{\theta_0}dz,\]

and its $(a, b)$ element is

\[j_{ab} = -\mathbb{E}_{g(z)}\left[\left.\frac{\partial^2 \ln f(Z; \theta)}{\partial\theta_a\partial\theta_b}\right|_{\theta_0}\right] = -\int g(z)\left.\frac{\partial^2\ln f(z; \theta)}{\partial\theta_a\partial\theta_b}\right|_{\theta_0}dz.\]

Then, since $D_3$ is the expectation of $\eta(\theta_0) - \eta(\hat{\theta})$ in terms of $g(x_n)$, we have

\[\begin{aligned} D_3 &= \mathbb{E}_{g(x_n)}\left[n\mathbb{E}_{g(z)}[\ln f(Z; \theta_0)] - n\mathbb{E}_{g(z)}[\ln f(Z; \hat{\theta})]\right]\\ &= \frac{n}{2}\mathbb{E}_{g(x_n)}\left[(\hat{\theta} - \theta_0)^\top J(\theta_0)(\hat{\theta} - \theta_0)\right] \\ &= \frac{n}{2}\mathbb{E}_{g(x_n)}\left[\mathrm{tr}(J(\theta_0)(\hat{\theta}-\theta_0)(\hat{\theta}-\theta_0)^\top)\right] \\ &= \frac{n}{2}\mathrm{tr}\left(J(\theta_0)\mathbb{E}_{g(x_n)}[(\hat{\theta} - \theta_0)(\hat{\theta} - \theta_0)^\top]\right). \end{aligned}\]

The asymptotic variance-covariance matrix of $\hat{\theta}$ is given as

\[\mathbb{E}_{g(x_n)}\left[(\hat{\theta} - \theta_0)(\hat{\theta} - \theta_0)^\top\right] = \frac{1}{n}J(\theta_0)^{-1}I(\theta_0)J(\theta_0)^{-1},\]

where $I(\theta_0)$ is the Fisher information matrix

\[\begin{aligned} I(\theta_0) &= \mathbb{E}_{g(z)}\left[\left.\frac{\partial\ln f(Z; \theta)}{\partial\theta}\frac{\partial\ln f(Z; \theta)}{\partial\theta^\top}\right|_{\theta_0}\right] \\ &= \int g(z)\left.\frac{\partial\ln f(z;\theta)}{\partial\theta}\frac{\partial\ln f(z; \theta)}{\partial\theta^\top}\right|_{\theta_0}dz. \end{aligned}\]

Then, we have

\[D_3 = \frac{1}{2}\mathrm{tr}(I(\theta_0)J(\theta_0)^{-1}).\]

(iii) Computation of $D_1$

From the Taylor expansion of $\ell(\theta) = \ln f(X_n; \theta)$ around $\hat{\theta}$, we have

\[\ell(\theta) = \ell(\hat{\theta}) + (\theta - \hat{\theta})^\top\frac{\partial\ell(\hat{\theta})}{\partial\theta} + \frac{1}{2}(\theta - \hat{\theta})^\top\frac{\partial^2\ell(\hat{\theta})}{\partial\theta\partial\theta^\top}(\theta - \hat{\theta}) + \cdots.\]

From the fact that

\[\frac{1}{n}\frac{\partial^2\ell(\hat{\theta})}{\partial\theta\partial\theta^\top} = \frac{1}{n}\frac{\partial^2\ln f(X_n; \hat{\theta})}{\partial\theta\partial\theta^\top}\]

converges to $J(\theta_0)$ in probability as $n\to\infty$, we can obtain

\[\ell(\theta_0) - \ell(\hat{\theta}) \approx -\frac{n}{2}(\theta_0 - \hat{\theta})^\top J(\theta_0)(\theta_0 - \hat{\theta}).\]

Then, we have

\[\begin{aligned} D_1 &= \mathbb{E}_{g(x_n)}\left[\ln f(X_n; \hat{\theta}(X_n)) - \ln f(X_n; \theta_0)\right] \\ &= \frac{n}{2}\mathbb{E}_{g(x_n)}\left[(\theta_0 - \hat{\theta})^\top J(\theta_0)(\theta_0 - \hat{\theta})\right] \\ &= \frac{n}{2}\mathbb{E}_{g(x_n)}\left[\mathrm{tr}\left(J(\theta_0)(\theta_0 - \hat{\theta})(\theta_0 - \hat{\theta})^\top\right)\right] \\ &= \frac{n}{2}\mathrm{tr}\left(J(\theta_0)\mathbb{E}_{g(x_n)}[(\hat{\theta} - \theta_0)(\hat{\theta} - \theta_0)^\top]\right) \\ &= \frac{1}{2}\mathrm{tr}\left(I(\theta_0)J(\theta_0)^{-1}\right). \end{aligned}\]

From (i), (ii) and (iii), the bias is asymptotically obtained as

\[\begin{aligned} b(g) &= D_1 + D_2 + D_3 \\ &= \frac{1}{2}\mathrm{tr}\left(I(\theta_0)J(\theta_0)^{-1}\right) + 0 + \frac{1}{2}\mathrm{tr}\left(I(\theta_0)J(\theta_0)^{-1}\right) \\ &= \mathrm{tr}\left(I(\theta_0)J(\theta_0)^{-1}\right). \end{aligned}\]

Finally, we consider the estimation of the bias $b(g)$. Let $\hat{I}$ and $\hat{J}$ are consistent estimators of $I(\theta_0)$ and $J(\theta_0)$, respectively. The estimator of bias is

\[\hat{b} = \mathrm{tr}\left(\hat{I}\hat{J}^{-1}\right).\]

The bias correction of the log-likelihood using $\hat{b}$ yields the following information criterion.

\[\begin{aligned} \mathrm{TIC} &= -2\left(\sum^n_{i=1}\ln f(X_i; \hat{\theta}) - \mathrm{tr}(\hat{I}\hat{J}^{-1})\right) \\ &= -2\sum^n_{i=1}\ln f(X_i; \hat{\theta}) + 2\mathrm{tr}\left(\hat{I}\hat{J}^{-1}\right). \end{aligned}\]

This is called Takeuchi Information Criterion (TIC).