An Excel TUTORIAL. Part 1: Descriptive Statistics

Part 1: Descriptive Statistics

This chapter covers basic information regarding the methods used by R for organizing and graphing data, respectively.

Section 1: Modifying Data

Projects often involve so much data that it is difficult to analyze all of the data at once. We present some methods to manipulate data in order to make the data more manageable.

Subsection: Types of Data

Data can be splitted in several ways based on how they are collected and the type collected.

Subsection: Modifying Data in Excel

Subsection: Sorting and Filtering Data

R contains useful features for sorting and filtering data so that one can more easily identify patterns.

For example, with the data.frame below I would like to sort by column z (descending) then by column b (ascending):

You can use the order() function directly without resorting to add-on tools -- see this simpler answer which uses a trick right from the top of the example(order) code:

Should work the same way, but you can't use with. Try M <- matrix(c(1,2,2,2,3,6,4,5), 4, 2, byrow=FALSE, dimnames=list(NULL, c("a","b"))) to create a matrix M, then use M[order(M[,"a"],-M[,"b"]),] to order it on two columns.

To begin understanding how to properly sort data frames in R, we of course must first generate a data frame to manipulate.

Note: The spacing isn’t necessary, but it improves legibility.

Executing our run.R script outputs the list of vectors in our data frame as expected, in the order they were entered.

The Order Function

While perhaps not the easiest sorting method to type out in terms of syntax, the one that is most readily available to all installations of R, due to being a part of the base module, is the order function.

The order function accepts a number of arguments, but at the simplest level the first argument must be a sequence of values or logical vectors.

RStudio includes a data viewer that allows you to look inside data frames and other rectangular data structures. The viewer also allows includes some simple exploratory data analysis (EDA) features that can help you understand the data as you manipulate it with R. You can invoke the viewer in a console by calling the View function on the data frame you want to look at. For instance, to view the built-in iris dataset, run these commands:

Sorting a Data Frame by Vector Name

As you might expect, you can sort by any column by just by clicking on the column. Click on a column that’s already sorted to reverse the sort direction.

For example, we can use order() to simply sort a vector of five randomly ordered numbers with this script:

Executing the script, we see the initial output of the unordered vector, followed by the now ordered list afterward.

With the order() function in our tool belt, we’ll start sorting our data frame by passing in the vector names within the data frame.

For example, using our previously generated dataframe object, we can sort by the vector z by adding the following code to our script:

What we’re effectively doing is calling our original dataframe object, and passing in the new index order that we’d like to have. This index order is generated using the with() function, which effectively creates a new environment using the passed in data in the first argument along with an expression for evaluating that data in the second argument.

Thus, we’re reevaluating the dataframe data using the order() function, and we want to order based on the z vector within that data frame. This returns a new index order for the data frame values, which is then finally evaluated within the [brackets] of dataframe[], outputting our new ordered result.

Consequently, we see our original unordered output, followed by a second output with the data sorted by column z.

Sorting by Column Index

Similar to the above method, it’s also possible to sort based on the numeric index of a column in the data frame, rather than the specific name.

Instead of using the with() function, we can simply pass the order() function to our dataframe. We indicate that we want to sort by the column of index 1 by using the dataframe[,1] syntax, which causes R to return the levels (names) of that index 1 column. In other words, similar to when we passed in the z vector name above, order is sorting based on the vector values that are within column of index 1:

As expected, we get our normal output followed by the sorted output in the first column:

Sorting by Multiple Columns

In some cases, it may be desired to sort by multiple columns. Thankfully, doing so is very simple with the previously described methods.

To sort multiple columns using vector names, simply add additional arguments to the order() function call as before:

Similarly, to sort by multiple columns based on column index, add additional arguments to order() with differing indices:

How to sort in decreasing order

Just like sort(), the order() function also takes an argument called decreasing. For example, to sort some.states in decreasing order of population:

Suppose we want to sort a vector, matrix, or data frame.

You’ll see the age of the first tree change from 118 to 120 in the viewer.

This auto-refreshing feature has some prerequisites, so if it doesn’t seem to be working:
You must call View() on a variable directly. If, for instance, you call
View(as.data.frame(foo)) or View(rbind(foo, bar)) you’re invoking View() on a new object created by evaluating your expression, and while that object contains data, it’s just a copy and won’t update when foo and bar do.

The number of rows the viewer can display is effectively unbounded, and large numbers of rows won’t slow down the interface. It uses the DataTables JavaScript library to virtualize scrolling, so only a few hundred rows are actually loaded at a time.

While rows are unbounded, columns are capped at 100. It’s not currently possible to virtualize columns in the same way as rows, and large numbers of columns cause the interface to slow significantly.

Finally, while we’ve made every effort to keep things speedy, very large amounts of data may cause sluggishness, especially when a sort or filter is applied, as this requires R to fully scan the frame. If you’re working with large frames, try applying filters to reduce it to the subset you’re interested in to improve performance.

Data frames

To sort a data frame on one or more columns, you can use the arrange function from plyr package, or use R’s built-in functions. The arrange function is much easier to use, but does require the external package to be installed.

Note that the size column is a factor and is sorted by the order of the factor levels. In this case, the levels were automatically assigned alphabetically (when creating the data frame), so large is first and small is last.

Reverse sort

The overall order of the sort can be reversed with the argument decreasing=TRUE.

To reverse the direction of a particular column, the method depends on the data type:

Numbers: put a - in front of the variable name, e.g. df[ order(-df$weight), ].
Factors: convert to integer and put a - in front of the variable name, e.g. df[ order(-xtfrm(df$size)), ].
Characters: there isn’t a simple way to do this. One method is to convert to a factor first and then sort as above.

------------------------------------------------------- http://sites.stat.psu.edu/~drh20/R/html/base/html/sort.html Conditional Formatting in Data Frames https://www.rdocumentation.org/packages/condformat/versions/0.7.0 The example is properly formatted at http://zeehio.github.io/condformat. R-script: --------------------------------------------------- data(iris) library(condformat) condformat(iris[c(1:5,70:75, 120:125),]) %>% rule_fill_discrete(Species) %>% rule_fill_discrete(c(Sepal.Width, Sepal.Length), expression = Sepal.Width > Sepal.Length - 2.25, colours = c("TRUE" = "#7D00FF")) %>% rule_fill_gradient2(Petal.Length) %>% rule_css(Sepal.Length, expression = ifelse(Species == "setosa", "bold", "regular"), css_field = "font-weight") %>% rule_css(Sepal.Length, expression = ifelse(Species == "setosa", "yellow", "black"), css_field = "color") -------------------------------------------------------- More Examples on Styling Cells, Rows, and Tables: https://rstudio.github.io/DT/010-style.html Filtering data: https://stackoverflow.com/questions/1686569/filter-data-frame-rows-by-a-logical-condition http://www.rexamples.com/11/Filtering%20data https://www.rdocumentation.org/packages/dplyr/versions/0.7.3/topics/filter https://blog.exploratory.io/filter-data-with-dplyr-76cf5f1a258e https://stats.stackexchange.com/questions/6187/filtering-a-dataframe

Section 2: Creating Distributions

Section 3: Measures of Location

The most important aspect of studying the distribution of a sample of measurements is locating the position of a central value about which the measurements are distributed. The arithmetic mean (average) of a set of n measurements $ X_1 , X_2 , \ldots , X_n $ is given by the formula

\[ \overline{X} = \frac{1}{n}\, \sum_{i=1}^n X_i = \frac{X_1 + X_2 + \cdots + X_n}{n} . \]

The mean provides a measure of central location for the data. If the data are for a sample (typically the case), the mean is denoted by $ \overline{X} . $ The sample mean is a point estimate of the (typically unknown) population mean for the variable of interest. If the data for the entire population are available, the population mean is computed in the same manner, but denoted either by $ E[X] , $ or by the Greek letter μ.

If the data are organized in the frequency distribution table then we can calculate the mean by the formula

\[ \overline{X} = \frac{1}{k}\, \sum_{i=1}^k n_i X_i , \]

where $ n_1 , n_2 , \ldots , n_k $ are frequencies of variable varieties $ X_1 , X_2 , \ldots , X_k . $

Elementary properties of the arithmetic mean:

the sum of deviations between the values and the mean is equal to zero:
\[ \sum_{i=1}^k \left( X_i - \overline{X} \right) =0 ; \]
if the variable is constant then the mean is equal to this constant:
\[ \frac{1}{k}\, \sum_{i=1}^k c = c; \]
if we add a constant to the values of the variable, then
\[ \frac{1}{k}\, \sum_{i=1}^k \left( X_i + c \right) = c + \overline{X} ; \]
if we multiply the values of the variable by a constant c, then

\[ \frac{1}{k}\, \sum_{i=1}^k c\cdot X_i = c \cdot \overline{X} . \]

The harmonic mean of a set of n measurements $ X_1 , X_2 , \ldots , X_k $ is defined by the formula

\[ \overline{X}_H = \frac{n}{\sum_{i=1}^n X_i^{-1}} . \]

In certain situations, especially many situations involving rates and ratios, the harmonic mean provides the truest average.

The geometric mean of a set of n measurements $ X_1 , X_2 , \ldots , X_k $ is defined by the formula

\[ \overline{X}_G = \left( X_1 \cdot X_2 \cdot \cdots \cdot X_n \right)^{1/n} = \sqrt[n]{X_1 \cdot X_2 \cdot \cdots \cdot X_n} . \]

The geometric mean may be more appropriate than the arithmetic mean for describing percentage growth.

Suppose an apple tree yields 50 oranges one year, then 60, 80 and 95 the following years, so the growth is 20 %, 60 % and 90 % for each of the years. Using the arithmetic mean, we can calculate an average growth as 56.66 % (20 % + 60 % + 90 % divided by 3). However, if we start with 50 apples and let it grow with (56+2/3) % for three years, the result is 220 applees, not 95.

Example: Calculate the arithmetic, harmonic and geometric mean of the first 10 Fibonacci numbers, $ F_{n+2} = F_{n+1} + F_n , \quad F_0 =0, \ F_1 =1 . $

The quantile x_p is the value of the variable which fulfils that 100p% of values of ordered sample (or population) are smaller or equal to x_p and 100(1−p) % of values of ordered sample (or population) are larger or equal to x_p.
The quantile is not uniquely defined.

There are three possible methods of calculating quantiles.

Sort the data in ascending order. Find the sequential index i_p of the quantile x_p that satisfies the inequalities
\[ n\, p < i_p < n\,p +1 . \]
The quantile x_p is then equal to the value of variable with the sequential index $ i_p - x_p = \langle x_p \rangle . $ If np and <np+1 are integers, we calculate the quantile as an aritmetic mean of $ \langle x_{np} \rangle $ and $ \langle x_{np+1} \rangle : $
\[ x_p = \frac{1}{2} \left( \langle x_{np} \rangle + \langle x_{np+1} \rangle \right) . \]
According to matlab, we calculate
\[ \overline{i_p} = \frac{np+np+1}{2} = \frac{2np+1}{2} , \]
which determine the location of the quantile. Using linear interpolation we get
\[ x_p = \langle x_{\lfloor \overline{i_p} \rfloor} \rangle + \left( \langle x_{\lfloor \overline{i_p +1} \rfloor} - \langle x_{\lfloor \overline{i_p} \rfloor} \right) \left( \overline{i_p} - \lfloor \overline{i_p} \rfloor \right) , \]
where $ \lfloor \cdot \rfloor $ denotes the integer part of the number, called the floor. If $ \overline{i_p} < 1 , $ then $ x_p = \langle x_{1} \rangle ; $ if $ \overline{i_p} > n, $ then $ x_p = \langle x_{n} \rangle . $
According to EXCEL, we assign values
\[ 0, \frac{1}{n-1} , \frac{2}{n-1} , \ldots , \frac{n-2}{n-1} \]
to the data sorted in ascending order. If p is equal to the multiple of $ \frac{1}{n-1} , $ the quantile x_p is equal to the value corresponding to the given multiple. If p is not the multiple of $ \frac{1}{n-1} , $ the inear interpolation is used.

The n-th percentile of an observation variable is the value that cuts off the first n percent of the data values when it is sorted in ascending order.

The median of an observation variable is the value at the middle when the data is sorted in ascending order. It is an ordinal measure of the central location of the data values.

We apply the median function to compute the median value of eruptions.

The mode $ \hat{X} $ is the value of variable with the highest frequency. In the case of continuous variable (data) the mode is the value where the histogram reaches its peak.

Section 4: Measures of Variability

Means, quantiles and a mode – measures of location – describe one property of frequency distribution – location. Another important property is dispersion (variation) which we describe by several measures of variation.

The range of variation R is defined as difference between the largest and the smallest value of the variable

\[ R = X_{\max} - X_{\min} . \]

It is the simplest but the rawest measure of variation. It indicates the width of the interval where all values are included.

The interquartile range:

\[ R_Q = X_{0.75} - X_{0.25} . \]

The interdecile range:

\[ R_D = X_{0.90} - X_{0.10} . \]

The nterpercentile range:

\[ R_C = X_{0.99} - X_{0.01} . \]

The interquartile range indicates the width of the interval which includes 50 % of middle values of ordered sample. By analogy the interdecile or the interpercentile range indicatethe width of the interval which includes 80 % or 98 % of middle values of ordered sample.

We have calculated quantiles of the data 2, 5, 7, 10, 12, 13, 18 and 21. We have the following values:
X_0.10 =2, X_0.25 =6, X_0.50 =11, X_0.75 =15.5, X_0.90 =2.

The range of variation is $ R= X_{\max} - X_{\min} = 21 - 2 =19 . $
The interquartile range is $ R_Q = X_{0.75} - X_{0.25} = 15.5 - 6 =9.5 . $
The interdecile range is $ R_D = X_{0.90} - X_{0.10} = 21 - 2 =19 . $

The quartile deviation is defined by the formula

\[ Q = R_Q /2 . \]

The decile deviation is defined by the following formula:

\[ D = R_D /8 . \]

The percentile deviation is defined by the formula

\[ C = R_C /98 . \]

Example: Calculate the quartile and the decile deviation of 2, 5, 7, 10, 12, 13, 18 and 21. The quartile deviation is
$ Q= R_Q /2 = 9.5/2 =4.75 . $
The decile deviation is $ D= R_D /8 = 19/8 =2.375 . $
It means that the average width of two (eight) middle quartile (decile) intervals is 4.75 (2.375).

The average deviation is defined as the arithmetic mean of the absolute deviations

\[ d_{\overline{X}} = \frac{1}{n} \, \sum_{i=1}^n \left\vert X_i - \overline{X} \right\vert . \]

Find the average deviation of a data set 1, 2, 5, 6, 7, 8, 8 and 9. Since the arithmetic mean is $ \overline{X} = 5.75 , $ we obtain

\begin{eqnarray*} d_{\overline{X}} &=& \frac{1}{8} \left[ |1 - 5.75| + |2- 5.75|+|5 - 5.75|+ |6 - 5.75| \right] + \\ && \frac{1}{8} \left[ |7 - 5.75| + |8-5.75| + |8-5.75| + |9-5.75| \right] = 2.3125 . \end{eqnarray*}

Subtitle: Variance

The variance s_n² is defined as the arithmetic mean of squares of deviations

\[ s_n^2 = \frac{1}{n} \, \sum_{i=1}^n \left\vert X_i - \overline{X} \right\vert^2 . \]

Expanding the sum above, we get

\begin{eqnarray*} s_n^2 &=& \frac{1}{n} \left( \sum_{i=1}^n X_i^2 - 2\,\overline{X} \,\sum_{i=1}^n X_i + \sum_{i=1}^n \overline{X}^2 \right) \\ &=& \frac{1}{n} \left[ \sum_{i=1}^n X_i^2 - 2\,n\,\overline{X}^2 + n\,\overline{X}^2 \right) \\ &=& \frac{1}{n} \, \sum_{i=1}^n X_i^2 - \overline{X}^2 = \overline{X^2} -\overline{X}^2 . \end{eqnarray*}

Elementary properties of the variance:

if the variable is constant, then the variance is zero.
if we add a constant to the values of the variable, then
\[ s_n^2 = \frac{1}{n} \, \sum_{i=1}^n \left[ \left( X_i + c \right) - \left( \overline{X} + c \right) \right]^2 . \]
f we multiply the values of the variable by a constant c, then
\[ \frac{1}{n} \, \sum_{i=1}^n \left( c \cdot X_i - c \cdot \overline{X} \right)^2 = c^2 \cdot s_n^2 . \]

The square root of the variance is called standard deviation

\[ s_n = \sqrt{s_n^2} . \]

The sample variance s² if defined by the formula

\[ s^2 = \frac{1}{n-1} \, \sum_{i=1}^n \left( \cdot X_i - \overline{X} \right)^2 . \]

The square root of the sample variance is called sample standard deviation

\[ s = \sqrt{s^2} . \]

It is obvious that

\[ s_n^2 = \frac{n-1}{n} \, s^2 . \]

Example: Calculate the variance, the standard deviation, the sample variance and the sample standard deviation of the data set 1, 2, 5, 6, 7, 8, 8 and 9.

The arithmetic mean is $ \overline{X} = 5.75 . $ So we have

\begin{eqnarray*} s_n^2 &=& \frac{1}{8} \left[ |1 - 5.75|^2 + |2- 5.75|^2 +|5 - 5.75|^2 + |6 - 5.75|^2 \right] + \\ && \frac{1}{8} \left[ |7 - 5.75|^2 + |8-5.75|^2 + |8-5.75|^2 + |9-5.75|^2 \right] = 7.4375 . \end{eqnarray*}

The variace can be also calculated by the formula $ s_n^2 = \overline{X^2} - \overline{X}^2 . $

\begin{eqnarray*} \overline{X^2} &=& \frac{1}{n}\, \sum_{i=1}^n X_i^2 = \frac{1}{8} \left[ 1^2 + 2^2 + 3^2 + 4^2 +5^2 +6^2 + 7^2 +8^2 + 9^2 \right] = 40.5 , \\ s_n^2 &=& \overline{X^2} - \overline{X}^2 = 40.5 - 5.75^2 = 7.4375 . \end{eqnarray*}

The standard deviation is

\[ s_n = \sqrt{s_n^2} = \sqrt{7.4375} \approx 2.72718 . \]

To get the sample variation we apply the formula

\[ s^2 = \frac{n}{n-1}\, s_n^2 = \frac{8}{7}\cdot 7.4375 = 8.5 . \]

The sample standard deviation is

\[ s = \sqrt{s^2} = \sqrt{8.5} \approx 2.91548 . \]

Section 5: Measures of Concentration

Often we want to show that some random quantity is close to its mean with high probability. Results of this kind are known as concentration of measure, which was first introduced in the early 1970s by the Russian mathematician Vitali Milman (born in 1939). In this section, we consider some simple concentration results and suggest the reader to look at important inequalities such as Hoeffding’s inequality, Bernstein’s inequality, and McDiarmid’s inequality at other sources.

We start with some basic definitions. Let X be a discrete random varibale. Its r-th moment is defined by the formula

\[ m'_r = \frac{1}{n}\, \sum_{i=1}^n X_i^r . \]

The r-th central moment is defined by the formula

\[ m_r = \frac{1}{n}\, \sum_{i=1}^n \left( X_i - \overline{X} \right)^r , \]

where $ \overline{X} = m_1 $ is the mean value of n values of X.

Moments can be calculated with R as follows:

Another option is to use the function moment from the e1071 package. As it is not in the core R library, the package has to be installed and loaded into the R workspace.

The sample skewness is defined by the formula

\[ a_3 = \frac{m_3}{m_2^{3/2}} = \frac{1}{n\,s_n^3} \, \sum_{i=1}^n \left( X_i - \overline{X} \right)^3 . \]

The skewness of a data population is defined by the following formula, where μ₂ and μ₃ are the second and third central moments.

\[ \gamma_1 = \frac{\mu_3}{\mu_2^{3/2}} . \]

Intuitively, the skewness is a measure of symmetry. As a rule, negative skewness indicates that the mean of the data values is less than the median, and the data distribution is left-skewed. Positive skewness would indicate that the mean of the data values is larger than the median, and the data distribution is right-skewed.

To calculate the skewness coefficient (of eruptions) one needs the function skewness from the e1071 package. As the package is not in the core R library, it has to be installed and loaded into the R workspace.

The kurtosis of a univariate population is defined by the following formula, ... moments . Intuitively, the kurtosis describes the tail shape of the data distribution. The normal distribution has zero kurtosis and thus the standard tail shape. It is said to be mesokurtic . ...
The sample kurtosis is defined by formula

\[ a_4 = \frac{m_4}{m_2^{2}} -3 = \frac{1}{n\,s_n^4} \, \sum_{i=1}^n \left( X_i - \overline{X} \right)^4 . \]

Note that Excel functions SKEW and KURT calculate skewness and kurtosis by formulas

\begin{eqnarray*} a_3^{\ast} &=& \frac{n}{(n-1)(n-2)} \,\sum_{i=1}^n \left( \frac{X_i - \overline{X}}{s} \right)^3 , \\ a_4^{\ast} &=& \frac{n(n+1)}{(n-1)(n-2)(n-3)} \,\sum_{i=1}^n \left( \frac{X_i - \overline{X}}{s} \right)^4 . \end{eqnarray*}

We can related them to ours:

\begin{eqnarray*} a_3 &=& \frac{n-2}{} \, a_3^{\ast} , \\ a &=& \frac{(n-2)(n-3)}{n^2 -1} \,a_4^{\ast} - \frac{6}{n+1} . \end{eqnarray*}

An Excel TUTORIAL for Introductory Statistics Applications