// Remove fullscreen button from SageCell.

An R TUTORIAL for Statistics Applications

Sampling Distributions

The sampling distribution of a statistic is the distribution of that statistic, considered as a random variable, when derived from a random sample of size n. It may be considered as the distribution of the statistic for all possible samples from the same population of a given sample size.

Email Vladimir Dobrushkin

A sampling distribution or finite-sample distribution is the probability distribution of a given statistic based on a random sample. Sampling distributions are important in statistics because they provide a major simplification en route to statistical inference. More specifically, they allow analytical considerations to be based on the sampling distribution of a statistic, rather than on the joint probability distribution of all the individual sample values.

Estimation is a process by which a numerical value or values are assigned to a population parameter based on the information collected from a sample.

In inferential statistics, μ is called the true population mean and p is called the true population proportion. There are many other parameters, such as the median, mode, variance, and standard deviation. In this section, we concern with these two (μ and p) paratemers.

The population distribution is the probability distribution derived from the information on all elements of a population. A sampling distribution can be thought of as a relative frequency distribution with a very large number of samples. More precisely, a relative frequency distribution approaches the sampling distribution as the number of samples approaches infinity. When a variable is discrete, the heights of the distribution are probabilities. When a variable is continuous, the class intervals have no width and and the heights of the distribution are probability densities.

A sampling distribution is a probability distribution of a statistic obtained through a large number of samples drawn from a specific population. The sampling distribution of a given population is the distribution of frequencies of a range of different outcomes that could possibly occur for a statistic of a population.

For any population data, there is only one value for its mean, σ and the only one value for standard deviation σ. However, we cannot say the same about the sample mean, which is usually denoted by \( \overline{x} . \) We would expect expect different samples of the same size drawn from the same population to yield different values of the sample mean. Its value will depend on elements included in a particular sample. Consequently, the sample mean \( \overline{x} \) is a random variable. Therefore, like other random variables, the sample mean possesses a probability distribution, which is more commonly called the sampling distribution of \( \overline{x} .\) Other sample statistics, such as the medianm mode, and standard deviation, also possess sampling distributions.

The probability distribution of \( \overline{x} \) is called its sampling distribution. It lists the various values that \( \overline{x} \) can assume and the probability of each value of \( \overline{x} .\)

Example: Suppose there are only eight students in an advanced statistics class and the midterm scores of these students are

\[ 70\quad 76 \quad 80 \quad 82 \quad 88 \quad 88 \quad 91 \quad 95 \] Let x denote the score of a student in this class. Each score except one corresponding 88 has relative frequency 1/8, which score 88 has relative frequency 1/4. These frequencies give the population probability distribution. Now we use R to calculate population parameters.
Therefore, the population mean is \( \mu = 83.75 \) and standard deviation is \( \sigma = 8.293715 . \)

Now we take a sample of four scores: 80, 82, 88, 95 and calculate sample statistics:

So we get for this particular sample, its mean is \( \overline{x} \approx 86.25 \) with standard deviation \( s \approx 6.751543 . \) So we see that sample mean exceeds true mean by 2.5 and sample standard deviation is not accurate by about 1.54217.

If we take another sample: 70, 80, 91, 95, the results will be completely different: \( \overline{x} = 84 \) and \( s \approx 11.28421 . \) So we see that sampling mean is a random variable depending on sample chosen. Total number of samples of size four available for our disposal is \[ \binom{8}{4} = \frac{8^{\underline{4}}}{4!} = \frac{8 \cdot 7 \cdot 6 \cdot 5}{1 \cdot 2 \cdot 3 \cdot 4} = 70. \]

We check the answer with R:

From statistical theory we know that s2 must have the smallest variance among all unbiased estimators of σ2, and so it is natural to wonder how much precision of estimation is lost by basing an estimate of σ2 on R instead of s2.

Example: Let \( X_1, X_2 , \ldots , X_8 \) be a random sample from NORM(100,8). The R-scipt below simulate the sample range R and for m = 100 000 such samples in order to learn about the distribution of R.

The code reveals that sample mean is \( E(R) \approx 24.60876 \) with the standard deviation \( s \approx 6.370791 .\)

Usually, different samples selected from the same population will give different results because they contain different elements. The output obtained from any one sample will generally be different from the result obtained from the population. The difference between the value of a sample statistic obtained from a sample and the value of the corresponding population parameter obtained from the population is called the sampling error.

The word error implies that a mistake has been made, so the term sampling error makes it sound as if we made a mistake while sampling. Well this is wrong. And the term non-sampling error sounds as if it is the error we make from not sampling. And that is wrong too. However these terms are used extensively in the statistics curriculum, so it is important that we clarify what they are about.

Sampling error is the error that arises in a data collection process as a result of taking a sample from a population rather than using the whole population. In the case of the mean,

\[ \mbox{Sampling error} = \overline{x} - \mu \] assuming that the sample is random and no nosampling error has been made.

Sampling error is one of two reasons for the difference between an estimate of a population parameter and the true, but unknown, value of the population parameter. The other reason is nonsampling error. Even if a sampling process has no non-sampling errors then estimates from different random samples (of the same size) will vary from sample to sample, and each estimate is likely to be different from the true value of the population parameter.

Non-sampling error is the error that arises in a data collection process as a result of factors other than taking a sample. These errors occur because of human mistakes, and not chance. The errors that occur in the collection, recording, and tabulation of data are called nonsampling errors. There are many different types of non-sampling errors.

Nonsampling errors can be attributed to many sources, e.g., inability to obtain information about all cases in the sample, definitional difficulties, differences in the interpretation of questions, inability or unwillingness on the part of the respondents to provide correct information, inability to recall information, errors made in collection such as in recording or coding the data, errors made in processing the data, missing data, biases resulting from the differing recall periods caused by the interviewing pattern used.

Non-sampling errors can be further divided into coverage errors, measurement errors (respondent, interviewer, questionnaire, collection method…), non-response errors and processing errors. The coverage errors are generally not well measured for income and are usually inferred from exercises of data confrontation such as this. Non-response can be an issue in the case of surveys.

In administrative data – in particular the personal tax returns – the filing rates for specific populations may depend on a variety of factors (amount owed, financial activity during the year, personal interest, requirement for eligibility to support programs, etc.) and this could also result in differences in the estimates generated by the programs producing income data.

The mean and standard deviation calculated for the sampling distribution of \( \overline{x} \) are called the mean and standard deviation of \( \overline{x} . \) Actually, the mean and standard deviation of \( \overline{x} \) are, respectively, the mean and standard deviation of the means of all samples of the same size selected from a population. The standard deviation of \( \overline{x} \) is also called the standard error of \( \overline{x} . \)

The mean and standard deviation of the sampling distribution of \( \overline{x} \) are denoted by \( \mu_{\overline{x}} \) and \( \sigma_{\overline{x}} , \) respectively.

Let X be the standard normal distribution (with mean 0 and standard deviation 1). Then X2 has the special distribution, which is usually referred to as the chi-square distribution. It is often denoted by χ2 distribution, where χ is the lowercase Greek letter chi. A random variable X has the chi-square distribution with ν degrees of freedom if its probability density is given by

\[ f(x) = \begin{cases} \dfrac{1}{2^{\nu /2} \Gamma (\nu /2)} \, x^{\nu /2 -1} \, e^{-x/2} , & \quad\mbox{for } x>0, \\ 0, & \quad\mbox{elsewhere}. \end{cases} \]

The mean and the variance of the chi-square distribution with ν degrees of freedom are ν and 2ν, respectively.

We built an example of chi-square distribution from three standard normal distributions.

You can compare chi-square distributions with three and four degrees of freedom:

We can generate chi-square distribution with, say, 20 degrees of freedom directly using the following R script:

From previous sections we learn that for random samples from a normal population with the mean μ and the variance σ2, the random variable \( \overline{x} \) has a normal distribution with the mean μ and the variance \( \frac{\sigma^2}{n} ; \) in other words,

\[ \dfrac{\overline{x} - \mu}{\sigma/\sqrt{n}} \] has the standard normal distribution. This is a very important result, but the major difficulty in applying it is that in most realistic applications the population standard deviation σ is unknown. This makes it necessary to replace σ with an estimate, usually with the value of the sample standard deviation s.
If Y and Z are independent random variables, Y has a chi-square distribution with ν degrees of freedom, and Z has the standard normal distribution, then the distribution of \[ T = \frac{Z}{\sqrt{Y/\nu}} \] is given by \[ f(t) = \dfrac{\Gamma \left( \frac{\nu +1}{2} \right)}{\sqrt{\pi\nu} \,\Gamma (\nu /2)} \left( 1 + \frac{t^2}{\nu} \right)^{-1-\nu /2} \qquad\mbox{for } - \infty < y < \infty, \] and it is called the t distribution> with ν degrees of freedom.
William Gosset.

The t distribution was introduced originally by the English statistician William Sealy Gosset (1876--1937), who published his scientific writings under the pen name "Student," since the brewery company for which he worked did not permit publication by employees. Thus, the t distribution is also known as the student-t distribution, or student's t distribution.

Theorem: If \( \overline{x} \) and s are the mean and standard deviation of a random sample of size n from a normally distributed population with mean μ, then \[ T = \frac{\overline{x} - \mu}{s/\sqrt{n}} \] has t distribution with n-1 degrees of freedom. ■

We build t-distribution from normal distributions:

Another distribution that plays an important role in connection with sampling from normal population is the F distribution, named after the British statistician and geneticist Sir Ronald Aylmer Fisher (1890--1962). Fisher pioneered the application of statistical procedures to the design of scientific experiments.

Ronald Fisher.

we build F distribution using the definition.