// Remove fullscreen button from SageCell.

An R TUTORIAL for Statistics Applications

Statistical Inference: Estimation of the Mean and Proportion

The most fundamental point and interval estimation process involves the estimation of a population mean. When the sample mean is used as a point estimate of the population mean, some error can be expected owing to the fact that a sample, or subset of the population, is used to compute the point estimate.

Email Vladimir Dobrushkin

Instead of point estimate, this chapter deals with determination of a population parameter. Confidence intervals consist of a range of values (interval) that act as good estimates of the unknown population parameter. However, the interval computed from a particular sample does not necessarily include the true value of the parameter. Since the observed data are random samples from the true population, the confidence interval obtained from the data is also random. If a corresponding hypothesis test is performed, the confidence level is the complement of the level of significance

A point estimate is a single value given as the estimate of a population parameter that is of interest, for example, the mean of some quantity. An interval estimate specifies instead a range within which the parameter is estimated to lie. Interval estimates can be contrasted with point estimates. Confidence intervals are commonly reported in tables or graphs along with point estimates of the same parameters, to show the reliability of the estimates.

Suppose we want to estimate an actual population mean μ. As you know, we can only obtain \( \overline{x} , \) the mean of a sample randomly selected from the population of interest. We can use \( \overline{x} \) to find a range of values:

\[ \mbox{Lower value} < \mbox{population mean } \mu < \mbox{Upper value} \] that we can be really confident contains the population mean μ. The range of values is called a confidence interval. The general form of most confidence intervals is \[ \mbox{Sample estimate} \pm \mbox{margin of error} . \] That is, \[ \mbox{the lower limit } L \mbox{ of the interval } = \mbox{estimate} - \mbox{margin of error} , \] and \[ \mbox{the upper limit } U \mbox{ of the interval } = \mbox{estimate} + \mbox{margin of error} ,. \] Once we have obtained the interval, we can claim that we are really confident that the value of the population parameter is somewhere between the value of L and the value of U.

The number we add and subtract from the point estimate is called the margin of error. The question arises: What number should we subtract from and add to a point estimate to obtain an interval estimate? The answer to this question depends on two considerations:

  1. The standard deviation \( \sigma_{\overline{x}} \) of the sample mean, \( \overline{x} . \)
  2. The level of confidence to be attached to the interval.
First, the larger the standard deviation of \( \overline{x} , \) the greater is the number subtracted from and added to the point estimate. Thus, it is obvious that if the range over which \( \overline{x} \) can assume values is larger, then the interval constructed around \( \overline{x} \) must be wider to include μ.

Second, the quantity subtracted and added must be larger if we want to have a higher confidence in our interval. It is a custom to attach a probabilistic statement to the interval estimation. This probabilistic statement is given by the confidence level. An interval constructed based on this confidence level is called a confidence interval. The confidence interval is given as

\[ \mbox{point estimate } \pm \mbox{margin of error} . \] The confidence level associated with a confidence interval states how much confidence we have that this interval contains the true population parameter. The confidence level is denoted by \( (1- \alpha )\,100\% , \) where α is the Greek letter alpha. When expressed as probability, it is called the confidence coefficient and is denoted by 1 - α. The α is called the significance level.

More generally and more precisely, we can say that 100(1-α)% of all samples of size n have means within the interval:

\[ \left[ \overline{x} - z_{\alpha /2} \cdot \frac{\sigma}{\sqrt{n}} , \ \overline{x} + z_{\alpha /2} \cdot \frac{\sigma}{\sqrt{n}} \right] , \] where \( z_{\alpha /2} \) is the value of the standard normal distribution giving probability α/2, that is, \[ \frac{1}{\sqrt{2\pi}} \, \int_{-\infty}^{z_{\alpha /2}} {\text d}t \, e^{-t^2 /2} = \frac{\alpha}{2} . \] R has a special command to calculate z-values:
As you see, for standard normal distribution, R provides the numbers: \[ \frac{1}{\sqrt{2\pi}} \, \int_{-\infty}^{-1.959964} {\text d}t \, e^{-t^2 /2} = 0.025 \quad\mbox{and} \quad \frac{1}{\sqrt{2\pi}} \, \int_{-\infty}^{1.959964} {\text d}t \, e^{-t^2 /2} = 0.975 . \] For normal distribution with given mean μ = 2.5 and standard deviation σ = 4.75, we can obtain similar values: \[ \frac{1}{\sigma\,\sqrt{2\pi}} \, \int_{-\infty}^{-6.809829} {\text d}t \, e^{-(t- \mu )^2 /(2\,\sigma^2 )} = 0.025 \quad\mbox{and} \quad \frac{1}{\sigma\,\sqrt{2\pi}} \, \int_{-\infty}^{11.809829} {\text d}t \, e^{-(t- \mu )^2 /(2\,\sigma^2 )} = 0.975 . \] These numbers could be obtained from the standard normal distribution: \[ x = z\,\sigma + \mu \qquad \Longleftrightarrow \qquad z = \frac{x-\mu}{\sigma} . \] Therefore, we get \[ 11.809829 = 1.959964 * 4.75 + 2.5 \quad\mbox{and} \quad -6.809829 = -1.959964 * 4.75 + 2.5. \]

Note that we assume that the standard deviation σ of the the total population is known. The above z-interval procedure works reasonably well even when the variable is not normally distributed and the sample size is small or moderate, provided the variable is not too far from being normally distributed. Thus, we say that the z-interval procedure is robust to moderate violations of the normality assumptions.

Example: Consider weights of hockey players in NHL during 2017-2018 season, which has mean 173.5 lbs with standard deviation of 13.39 (according to the official NHL web data). Now we take a sample of five players from Washington Capital:

 Player  Weight
 Alexanter Ovechkin  236
 Nicklas Backstrom  214
 Jay Beagle  216
 Brooks Orpik  220
 Dmitry Orlov  209
We calculate the mean and the variance of the sample according to the formulas: \[ \overline{x} = \frac{236 + 214 + 216 + 220 + 209}{5} = 219, \qquad s^2 = \frac{1}{4} \, \sum_{k=1}^5 |X_k - \overline{x} |^2 = 106 . \] We check with R:

Therefore, this sample shows the mean of 219 with standard deviation of 10.29563. We know that the sample mean \( \overline{x} = 219 \) and its variance s2 are unbiased estimators of the population mean μ = 173.5 and the population variance \( \sigma^2 = 13.39^2 \approx 178.2921 . \) However, the sample standard deviation s is a biased estimator of the statistic parameter (in our case, standard deviation of the population).

Now we take another sample from Boston Bruins:

 Player  Weight
  Brad Marchand  181
  Patrice Bergeron  195
 David Pastrňák  188
 Torey Krug  186
 Brandon Carlo  208
This sample of five players gives the mean 191.6 and standard deviation 10.45466.

You can find the confidence interval using R. However, you need first to install two packages (the later one will be used for proportions).

install.packages("Rmisc", lib= "/data/Rpackages/")
install.packages("lattice", lib= "/data/Rpackages/")
install(plyr)
install.packages("PropCIs", lib= "/data/Rpackages/")
So we get the 95% interval for the mean to be [206.2163 , 231.7837], which does not contain the population mean. ■

lizard = c(6.2, 6.6, 7.1, 7.4, 7.6, 7.9, 8, 8.3, 8.4, 8.5, 8.6, + 8.8, 8.8, 9.1, 9.2, 9.4, 9.4, 9.7, 9.9, 10.2, 10.4, 10.8, + 11.3, 11.9) If we use the t.test command listing only the data name, we get a 95% confidence interval for the mean after the significance test. n.draw = 100 mu = 9 n = 24 SD = sd(lizard) draws = matrix(rnorm(n.draw * n, mu, SD), n) get.conf.int = function(x) t.test(x)$conf.int conf.int = apply(draws, 2, get.conf.int) sum(conf.int[1, ] <= mu & conf.int[2, ] >= mu) plot(range(conf.int), c(0, 1 + n.draw), type = "n", xlab = "mean tail length", + ylab = "sample run") for (i in 1:n.draw) lines(conf.int[, i], rep(i, 2), lwd = 2) abline(v = 9, lwd = 2, lty = 2)

The sample variance is calculated according to the formula

\[ s^2 = \frac{1}{n-1} \, \sum_i \left( x_i - \overline{x} \right)^2 . \] The above formula can be slightly modified:

\[ s^2 = \frac{1}{n-1} \left( \sum_i x_i^2 - \frac{1}{n} \left( \sum_i x_i \right)^2 \right) , \] where n is the sample size. When using this formula, do not perform any rounding until the computation is complete; otherwise, substantial roundoff error can result.

Upon taking a square root from the right hand side, we obtain the sample standard deviation, which is biased estimator of the population standard deviation. On the other hand, s2 is the unbiased estimator of the variance for the infinite population. However, it is not an unbiased estimator of the variance of a finite population. Recall that a statistic \( \hat{p} \) is an unbiased estimator of the parameter p if and only if its expected value is equal to \( E \left[ \hat{p} \right] = p . \)

Theorem: If s2 is the variance of a random sample from an infinite population with the finite variance σ2, then its expected value is equal to the mean of the population, that is, \( E\left[ s^2 \right] = \sigma^2 . \)

Proof: According to definition of the expected value, we have \begin{align*} E\left[ s^2 \right] &= E \left[ \frac{1}{n-1} \cdot \sum_{i=1}^n \left( x_i - \overline{x} \right)^2 \right] \\ &= \frac{1}{n-1} \cdot E \left[ \sum_{i=1}^n \left\{ \left( x_i - \mu \right) - \left( \overline{x} - \mu \right) \right\}^2 \right] \\ &= \frac{1}{n-1} \cdot \left[ \sum_{i=1}^n E \left( x_i - \mu \right)^2 - n \cdot E \left[ \left( \overline{x} - \mu \right)^2 \right] \right] . \end{align*} Then, since \( E \left[ \left( x_i - \mu \right)^2 \right] = \sigma^2 \) and \( E \left[ \left( \overline{x} - \mu \right)^2 \right] = \frac{\sigma^2}{n} , \) it follows that \[ E\left[ s^2 \right] = \frac{1}{n-1} \cdot \left[ \sum_{i=1}^n \sigma^2 - n \cdot \frac{\sigma^2}{n} \right] = \sigma^2 . \qquad ■ \]
Pafnuty Chebyshev (1821--1894).

The standard deviation is a measure of variation---the more variation there is in a data, the larger is its standard deviation. Almost all the observations in any data set lie within three standard deviations to either side of the mean. A more precise version of the three-standard deviations rule can be obtained from Chebyshev's rule:

For any quantitative data set and any real number k greater than or equal to 1, at least 1 - 1/k2 of the observations lie within k standard deviations to either side of the mean, that is, between \( \overline{x} - k\,s \) and \( \overline{x} + k\,s . \)

Example: We return to a sample of five hockey players taken from Washington Capital team.

A point estimate is a single value given as the estimate of a population parameter that is of interest, for example, the mean of some quantity. An interval estimate specifies instead a range within which the parameter is estimated to lie. Interval estimates can be contrasted with point estimates. Confidence intervals are commonly reported in tables or graphs along with point estimates of the same parameters, to show the reliability of the estimates.