# Chapter 4 Estimation

## 4.1 Point Estimation and the MLE

Point estimation is all about using one single number (statistic) to estimate a population parameter of interest. One question is how do we actually come up with those statistics?

In this section, we will illustrate the concept of maximum likelihood estimation. The idea of maximum likelihood estimation is to first assume our data come from a known family of distributions (e.g., normal) that contain parameters (e.g., mean $$\mu$$, variance $$\sigma^2$$). Then the maximum likelihood estimates (MLEs) of the parameters will be the parameter values that are most likely to have generated our data, where “most likely” is measured by the likelihood function.

To start, let’s create a simple data set. We will generate $$n=25$$ normal random variables with mean $$\mu=5$$ and variance $$\sigma^2=1$$.

x = rnorm(25, mean=5, sd=1)

Now we will pretend as if we do not know the mean parameter of the population distribution $$\mu$$ and then use the method of maximum likelihood to estimate $$\mu$$.

Remember that the MLE of $$\mu$$ is defined as $$\hat{\mu}^{\text{MLE}}=\arg\max f(x_1,\dotsc,x_n|\mu,\sigma^2)$$. That is, $$\hat{\mu}^{\text{MLE}}$$ is the value of $$\mu$$ that maximizes the likelihood function.

Once we have made the assumption that are our data are normally distributed, the likelihood function takes the form $f(x_1,\dotsc,x_n|\mu,\sigma^2)=\prod_{i=1}^n f(x_i|\mu,\sigma^2)=\prod_{i=1}^n{1\over\sqrt{2\pi\sigma^2}}\exp\left(-{1\over2\sigma^2}(x_i-\mu)^2\right)$ where the right-most term is the PDF of the normal distribution.

If we use calculus to formally maximize the likelihood function, we will find that $$\hat{\mu}^{\text{MLE}}=\bar{X}$$.

Since the MLE of $$\mu$$ is the sample mean, computing the MLE in R becomes straightforward.

mean(x)
## [1] 5.162

Therefore, the MLE of $$\mu$$ is about 5.16.

To help provide some intuition for how the maximum likelihood method works, let’s work through an example. Rather than try to work through the calculus to show that $$\hat{\mu}^{\text{MLE}}=\bar{x}=5.16$$, we can plot the likelihood function against many candidate values of $$\mu$$ and see where the curve is highest.

Since the likelihood function is a product of positive numbers (many of which can be very small), it’s more convenient and numerically stable to work with the log-likelihood function. $\log f(x_1,\dotsc,x_n|\mu,\sigma^2)=\log\left(\prod_{i=1}^n f(x_i|\mu,\sigma^2)\right)=\sum_{i=1}^n \log f(x_i|\mu,\sigma^2)$

To compute the log-likelihood function for each $$x_i$$, we can use the dnorm( ) function.

Let’s start by evaluating the log-likelihood for each observation. To do this we need to plug in a value for the mean (which we will fix to its true value $$\mu=5$$).

dnorm(x, mean=5, sd=1, log=TRUE)
##  [1] -1.2702 -1.0084 -0.9244 -1.4070 -2.8072 -0.9735 -2.2069 -1.6471
##  [9] -1.3350 -0.9204 -1.0977 -1.0532 -2.3580 -2.0379 -1.1200 -0.9289
## [17] -1.0902 -1.4054 -1.5403 -0.9302 -2.1302 -1.5254 -1.9105 -2.3638
## [25] -1.7857

The entire log-likelihood function is the sum of the individual log-likelihood evaluations. Therefore, the log-likelihood for $$\mu=5$$ can be found as follows.

sum(dnorm(x, mean=5, sd=1, log=TRUE))
## [1] -37.78

This number now represents a measure of the relative likelihood that our data x were generated from a normal distribution with mean 5. We can the imagine using the likelihood function as a way of finding “good” estimates of $$\mu$$. The “best” estimate of $$\mu$$ would correspond to the value that is most likely to have generated our data.

As a simple example, we could say: is it more likely that our data come from a distribution with mean $$\mu=5$$ or $$\mu=6$$?

To test this, we could evaluate the log likelihood function for $$\mu=6$$ and compare.

sum(dnorm(x, mean=6, sd=1, log=TRUE))
## [1] -46.22

Given that the likelihood function is higher (less negative) for $$\mu=5$$, we say that $$\hat{\mu}=5$$ is our estimate of $$\mu$$.

Since $$\mu$$ can be any real number, there is no reason to restrict plausible values of $$\mu$$ to the set of integers. Instead, we can lay out a grid of candidate parameter values of any desired level of granularity.

index = seq(3, 7, by=.01)
index
##   [1] 3.00 3.01 3.02 3.03 3.04 3.05 3.06 3.07 3.08 3.09 3.10 3.11 3.12 3.13
##  [15] 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 3.22 3.23 3.24 3.25 3.26 3.27
##  [29] 3.28 3.29 3.30 3.31 3.32 3.33 3.34 3.35 3.36 3.37 3.38 3.39 3.40 3.41
##  [43] 3.42 3.43 3.44 3.45 3.46 3.47 3.48 3.49 3.50 3.51 3.52 3.53 3.54 3.55
##  [57] 3.56 3.57 3.58 3.59 3.60 3.61 3.62 3.63 3.64 3.65 3.66 3.67 3.68 3.69
##  [71] 3.70 3.71 3.72 3.73 3.74 3.75 3.76 3.77 3.78 3.79 3.80 3.81 3.82 3.83
##  [85] 3.84 3.85 3.86 3.87 3.88 3.89 3.90 3.91 3.92 3.93 3.94 3.95 3.96 3.97
##  [99] 3.98 3.99 4.00 4.01 4.02 4.03 4.04 4.05 4.06 4.07 4.08 4.09 4.10 4.11
## [113] 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 4.23 4.24 4.25
## [127] 4.26 4.27 4.28 4.29 4.30 4.31 4.32 4.33 4.34 4.35 4.36 4.37 4.38 4.39
## [141] 4.40 4.41 4.42 4.43 4.44 4.45 4.46 4.47 4.48 4.49 4.50 4.51 4.52 4.53
## [155] 4.54 4.55 4.56 4.57 4.58 4.59 4.60 4.61 4.62 4.63 4.64 4.65 4.66 4.67
## [169] 4.68 4.69 4.70 4.71 4.72 4.73 4.74 4.75 4.76 4.77 4.78 4.79 4.80 4.81
## [183] 4.82 4.83 4.84 4.85 4.86 4.87 4.88 4.89 4.90 4.91 4.92 4.93 4.94 4.95
## [197] 4.96 4.97 4.98 4.99 5.00 5.01 5.02 5.03 5.04 5.05 5.06 5.07 5.08 5.09
## [211] 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17 5.18 5.19 5.20 5.21 5.22 5.23
## [225] 5.24 5.25 5.26 5.27 5.28 5.29 5.30 5.31 5.32 5.33 5.34 5.35 5.36 5.37
## [239] 5.38 5.39 5.40 5.41 5.42 5.43 5.44 5.45 5.46 5.47 5.48 5.49 5.50 5.51
## [253] 5.52 5.53 5.54 5.55 5.56 5.57 5.58 5.59 5.60 5.61 5.62 5.63 5.64 5.65
## [267] 5.66 5.67 5.68 5.69 5.70 5.71 5.72 5.73 5.74 5.75 5.76 5.77 5.78 5.79
## [281] 5.80 5.81 5.82 5.83 5.84 5.85 5.86 5.87 5.88 5.89 5.90 5.91 5.92 5.93
## [295] 5.94 5.95 5.96 5.97 5.98 5.99 6.00 6.01 6.02 6.03 6.04 6.05 6.06 6.07
## [309] 6.08 6.09 6.10 6.11 6.12 6.13 6.14 6.15 6.16 6.17 6.18 6.19 6.20 6.21
## [323] 6.22 6.23 6.24 6.25 6.26 6.27 6.28 6.29 6.30 6.31 6.32 6.33 6.34 6.35
## [337] 6.36 6.37 6.38 6.39 6.40 6.41 6.42 6.43 6.44 6.45 6.46 6.47 6.48 6.49
## [351] 6.50 6.51 6.52 6.53 6.54 6.55 6.56 6.57 6.58 6.59 6.60 6.61 6.62 6.63
## [365] 6.64 6.65 6.66 6.67 6.68 6.69 6.70 6.71 6.72 6.73 6.74 6.75 6.76 6.77
## [379] 6.78 6.79 6.80 6.81 6.82 6.83 6.84 6.85 6.86 6.87 6.88 6.89 6.90 6.91
## [393] 6.92 6.93 6.94 6.95 6.96 6.97 6.98 6.99 7.00

We now want to step through each value in the index and evaluate the log-likelihood function. The MLE will be the value in the index that yields the highest log-likelihood function.

To do this, we will use a “for loop” in R.

R = length(index)
loglike = double(R)
for(r in 1:R){
loglike[r] = sum(dnorm(x, mean=index[r], sd=1, log=TRUE))
}

Here R counts the number of elements in the index variable and loglike is a vector of R elements that are initially set to be zero but are then filled in in each iteration of the for loop.

plot(index, loglike, type="l")
abline(v=index[which.max(loglike)], col=2, lty=2)

Once we have evaluated the log-likelihood function for each value in the index, we can plot the log-likelihood against the index.

Based on this plot, it looks like the value that maximizes the log-likehood functin is a little bit greater than 5. We formally check this in R.

index[which.max(loglike)]
## [1] 5.16

This matches the MLE that we initially computed using the sample mean.

mean(x)
## [1] 5.162

The only difference is the fact that our estimate using the plot is rounded to two decimal places because of the grid that we selected when creating the index variable.

## 4.2 Confidence Intervals

Ultimately, point estimates are insufficient because the estimators themselves are random variables and subject to variation across different samples. Rather than provide a single point estimate, it will be better to provide a range of plausible values for the parameter of interest. This is the idea of confidence intervals.

A confidence interval for any parameter $$\theta$$ will always take the form: $\hat{\theta}\pm \text{(critical value)}\times \text{SD}(\hat{\theta})$ where $$\hat{\theta}$$ is an estimator of $$\theta$$, the critical value is a quantile of a normal or t distribution, and $$\text{SD}(\hat{\theta})$$ is the standard deviation of the estimator.

### 4.2.1 Types of Confidence Intervals

We will consider four types of $$(1-\alpha)100\%$$ confidence intervals.

1. Confidence interval for mean $$\mu$$, data are normally distributed, variance $$\sigma^2$$ is known $\bar{x}\pm z_{\alpha/2}{\sigma\over\sqrt{n}}$

2. Confidence interval for mean $$\mu$$, data are normally distributed, variance $$\sigma^2$$ is unknown $\bar{x}\pm t_{n-1,\alpha/2}{s\over\sqrt{n}}$

3. Confidence interval for mean $$\mu$$, data are not normally distributed $\bar{x}\pm t_{n-1,\alpha/2}{s\over\sqrt{n}}$

4. Confidence interval for proportion $$p$$, data are binary (0s and 1s) $\hat{p}\pm z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})\over n}$

Notice that we can compute the critical values in R. If we need $$z_{\alpha/2}$$, then we must find the value of a standard normal random variable such that $$\alpha/2$$ percent of the area is to its right. This is can be found using the normal quantile function qnorm( )!

For example, if $$\alpha=0.05$$ let’s use qnorm( ) to find $$z_{0.025}$$.

qnorm(0.025, mean=0, sd=1, lower.tail=FALSE)
## [1] 1.96

A summary of the most commonly used normal critical values are provided below.

Confidence Level $$(1-\alpha)$$ Critical Value $$z_{\alpha/2}$$
99% 2.576
95% 1.96
90% 1.645

Similarly, if we need to compute $$t_{n-1,\alpha/2}$$ with $$\alpha=0.05$$ and our data have $$n=50$$ observations, we can use the qt( ) function.

qt(0.025, df=50-1, lower.tail=FALSE)
## [1] 2.01

Notice this critical value is larger than $$z_{0.025}$$ – this comes from the fact that the t-distribution has fatter tails than the normal distribution. But, it also turns out that a t distribution (which only has one parameter $$\nu$$ called the “degrees of freedom”) converges to a normal distribution as $$\nu\to\infty$$. We can formally check this using qt( ).

qt(0.025, df=50, lower.tail=FALSE)
## [1] 2.009
qt(0.025, df=100, lower.tail=FALSE)
## [1] 1.984
qt(0.025, df=1000, lower.tail=FALSE)
## [1] 1.962
qt(0.025, df=10000, lower.tail=FALSE)
## [1] 1.96

Notice how these values approach $$z_{0.025}\approx1.96$$ as the degrees of freedom parameter gets large.

### 4.2.2 Examples

PROBLEM 1: A wine importer needs to report the average percentage of alcohol in bottles of French wine. From experience with previous kinds of wine, the importer believes that alcolhol percentages are normally distributed and the population standard deviation is 1.2%. The importer randomly samples 60 bottles of the new wine and obtains a sample mean $$\bar{x}=9.3\%$$. Give a 90% confidence interval for the average percentage of alcohol in all bottles of the new wine.

Solution:

From the problem, we know the following. \begin{align*} n = 60\\ \sigma=1.2\\ \bar{x} =9.3\\ \alpha=0.10 \end{align*} We must first figure out which type of confidence interval to use. Notice that we are trying to estimate the average percentage of alcohol, so our parameter is a mean $$\mu$$. Moreover, we are told to assume that the data are normally distributed and the population standard deviation $$\sigma$$ is known. Thefore, our confidence interval will be of the form: $\bar{x}\pm z_{\alpha/2}{\sigma\over\sqrt{n}}.$ We can now define each object in R and construct the confidence interval.

n = 60
sigma = 1.2
xbar = 9.3
zalpha = qnorm(0.05, mean=0, sd=1, lower.tail=FALSE)
xbar - zalpha*sigma/sqrt(n)
## [1] 9.045
xbar + zalpha*sigma/sqrt(n)
## [1] 9.555

Therefore, we are 90% confident that the true average alcohol content in all new bottles of wine is between 9.05% and 9.55%.

PROBLEM 2: An economist wants to estimate the average amount in checking accounts at banks in a given region. A random sample of 100 accounts gives $$\bar{x}=£357.60$$ and $$s=£140.00$$. Give a 95% confidence interval for $$\mu$$, the average amount in any checking account at a bank in the given region.

Solution:

From the problem, we know the following. \begin{align*} n &= 100\\ \bar{x} &=357.60\\ s &= 140\\ \alpha&=0.5 \end{align*} Here we are not told whether the data are normally distributed. However, it won’t matter because we only have an estimate of $$\sigma$$ (remember that among the four types of confidence intervals we considered earlier, there are no differences between case II and III). Therefore, our confidence interval will be of the form: $\bar{x}\pm t_{n-1,\alpha/2}{s\over\sqrt{n}}.$ We can again define each object in R and construct the confidence interval.

n = 100
xbar = 357.60
s = 140
talpha = qt(0.025, df=n-1, lower.tail=FALSE)
xbar - talpha*s/sqrt(n)
## [1] 329.8
xbar + talpha*s/sqrt(n)
## [1] 385.4

Therefore, we are 95% confident that the true average account checking account value in the given region is between £329.82 and £385.38.

PROBLEM 3: The EuStockMarkets data set in R provides daily closing prices of four major European stock indices: Germany DAX (Ibis), Switzerland SMI, France CAC, and UK FTSE. Using this data set, produce a 99% confidence interval for the average closing price of the UK FTSE.

Solution:

First, let’s load in the data from R.

data(EuStockMarkets)
head(EuStockMarkets)
DAX SMI CAC FTSE
1629 1678 1773 2444
1614 1688 1750 2460
1607 1679 1718 2448
1621 1684 1708 2470
1618 1687 1723 2485
1611 1672 1714 2467

Now let’s pull the subset of data we care about (i.e., the UK FTSE column).

uk = EuStockMarkets[,4]

Notice that we are not told anything about the true distribution of the data. Therefore, our confidence interval will be of the form: $\bar{x}\pm t_{n-1,\alpha/2}{s\over\sqrt{n}}.$ Next, let’s compute each component necessary to construct the 99% confidence interval.

n = length(uk)
xbar = mean(uk)
s = sd(uk)
talpha = qt(0.005, df=n-1, lower.tail=FALSE)
xbar - talpha*s/sqrt(n)
## [1] 3507
xbar + talpha*s/sqrt(n)
## [1] 3624

Therefore, we are 99% confident that the true closing price for the UK FTSE index is between £3,507.25 and £3,624.04.

Finally, notice that we can use the t.test( ) function to perform the same analysis but with fewer steps.

t.test(uk, conf.level=0.99)
##
##  One Sample t-test
##
## data:  uk
## t = 157, df = 1859, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 99 percent confidence interval:
##  3507 3624
## sample estimates:
## mean of x
##      3566