An R TUTORIAL for Statistics Applications. Part 1: Descriptive Statistics

Part 1: Descriptive Statistics

This chapter covers basic information regarding the methods used by R for organizing and graphing data, respectively.

Section 1: Loading Data into R

To import data into R, you first need to have data. This data can be saved in a file onto your computer in an Excel, SPSS, or some other type of file. When your data is saved locally, you can go back to it later to edit, to add more data or to change them, preserving the formulas that you maybe used to calculate the data, etc.

However, data can also be found on the Internet or can be obtained through other sources. Here is the link where you can find data:
https://www.kdnuggets.com/datasets/index.html
and DataCamp’s interactive tutorial https://www.datacamp.com/community/open-courses/quandl-r-tutorial, which deals with how to import and manipulate Quandl data sets. There are many packages that make life easier.

Here is how you install them (make sure you have an internet connection):

mylibraries = c(
  "data.table",  # fast and compact data processing
  "stringr",     # work with character strings
  "ggplot2",     # comprehensive plotting environment
  "hexbin",      # binning and plotting functions for hexagonal bins
  "zoo",         # work with time series
  "lubridate",   # work with dates
  "DescTools",   # descriptive analytics
  "readxl"       # read Excel files
)


# install.packages(mylibraries, dep = T)

If you run these commands, remove the “#” (comment) sign. Alternatively, you can install packages in RStudio via a menu: /Tools/Install packages/ (…type the names of the packages…).

Now that the packages are installed, you have to load them before you can work with them. Here is how you do that:

library(data.table)
library(stringr)
library(ggplot2)
library(zoo)
library(lubridate)
library(DescTools)

R comes with some datasets. We are going to use mtcars. In the script window (upper left), type the following:

data(mtcars)         # loads mtcars dataset
View(mtcars)         # view the data in a separate window

dim(mtcars)          # dimensions: rows x columns

str(mtcars)          # structure: name and type of variable

names(mtcars)        # column (variable) names of the dataset

head(mtcars)         # first few records of the dataset

summary(mtcars)      # statistical summary

?mtcars              # help about a dataset or command

getwd()              # which directory are you in?

dir()                # what are the files in that directory?

ls()                 # list the loaded objects in your current R session

#                    # comment: everything after this sign is ignored by R

1:5                  # prints the numbers 1 to 5

sum(1:5)             # sums the numbers 1 to 5

c(1,2,3)             # c = "concatenate" = make a vector with the numbers 1, 2 and 3

x = c(1,2,3)         # assign a name

You can run these commands by clicking “Source” in the upper right corner of the script window. It is also possible to run the commands one by one. Put your cursor anywhere on the line you want to run and press control-Enter. Note that the View window is very powerful. You can open the View in its own window, filter, search and sort.

You might have an environment that is still filled with data and values, which you can all delete using the following line of code:
rm(list=ls())

The rm() function allows you to “remove objects from a specified environment”. In this case, you specify that you want to consider a list for this function, which is the outcome of the ls() function. This last function returns you a vector of character strings that gives the names of the objects in the specified environment. Since this function has no argument, it is assumed that you mean the data sets and functions that you as a user have defined.

Next, you might also find it handy to know where your working directory is set at the moment:
getwd()

And you might consider changing the path that you get as a result of this function, maybe to the folder in which you have stored your data set:
setwd("")

Read TXT files with read.table()

If you have a .txt or a tab-delimited text file, you can easily import it with the basic R function read.table(). For example, it can be imported as follows:

Note that ideally, you should just pass in the file name and the extension because you have set your working directory to the folder in which your data set is located. You’ll have seen in the code chunk above that the first argument isn’t always a filename, but could possibly also be a webpage that contains data. The header argument specifies whether or not you have specified column names in your data file. Lastly, you’ll see that, by using this function, your data from the file will become a data.frame object. The read.table() function is the most important and commonly used function to import simple data files into R. It is easy and flexible.

Read CSV Files into R

If you have a file that separates the values with a , or ;, you usually are dealing with a .csv file. To successfully load this file into R, you can use the read.table() function in which you specify the separator character, or you can use the read.csv() or read.csv2() functions. The former function is used if the separator is a ,, the latter if ; is used to separate the values in your data file.

Remember that the read.csv() as well as the read.csv2() function are almost identical to the read.table() function, with the sole difference that they have the header and fill arguments set as TRUE by default.

In case you have a file with a separator character that is different from a tab, a comma or a semicolon, you can always use the read.delim() and read.delim2() functions. These are variants of the read.table() function, just like the read.csv() function. You can use the read.delim() and read.delim2() functions as follows:

XLConnect Package for Reading Excel Files

To load Excel files into R, you first need to do some further prepping of your workspace in the sense that you need to install packages. Simply run the following piece of code to accomplish this:
install.packages("")

The first way to get Excel files directly into R is by using the XLConnect package. Install the package and if you’re not sure whether or not you already have it, check if it is already there. Next, you can start using the readWorksheetFromFile() function, just like shown here below:
library(XLConnect) df <- readWorksheetFromFile("", sheet = 1)

Note that you need to add the sheet argument to specify which sheet you want to load into R. You can also add more specifications. You can also load in a whole workbook with the loadWorkbook() function, to then read in worksheets that you desire to appear as data frames in R through readWorksheet():
wb <- loadWorkbook("") df <- readWorksheet(wb, sheet=1)

The readxl package has only recently been published and allows R users to easily read in Excel files, just like this:
library(readxl) df <- read_excel("")

Note that the first argument specifies the path to your .xls or .xlsx file, which you can set by using the getwd() and setwd() functions. You can also add a sheet argument, just like with the XLConnect package, and many more arguments

Read JSON Files Into R

To get JSON files into R, you first need to install or load the rjson package. If you want to know how to install packages or how to check if packages are already installed, scroll a bit up to the section of importing Excel files into R :) Once you have done this, you can use the fromJSON() function. Here, you have two options:

Your JSON file is stored in your working directory:
# Activate `rjson` library(rjson) # Import data from json file JsonData <- fromJSON(file= "" )
Your JSON file is available through a URL:
# Activate `rjson` library(rjson) # Import data from json file JsonData <- fromJSON(file= "" )

Read XML Data Into R

If you want to get XML data into R, one of the easiest ways is through the usage of the XML package. First, you make sure you install and load the XML package in your workspace, just like demonstrated above. Then, you can use the xmlTreeParse() function to parse the XML file directly from the web:
# Activate the `XML` library library(XML) # Parse the XML file xmlfile <- xmlTreeParse("")

Next, you can check whether R knows that xmlfile is in XML by entering:
# Result is usually similar to this: [1] "XMLDocument" "XMLAbstractDocument" class(xmlfile)

You can use the xmlRoot() function to access the top node:
topxml <- xmlRoot(xmlfile)

You will see that the data is presented kind of weirdly when you try printing out the xmlfile vector. That is because the XML file is still a real XML document in R at this point. To put the data in a data frame, you first need to extract the XML values. You can use the xmlSApply() function to do this:

The first argument of this function will be topxml, since it is the top node on whose children you want to perform a certain function. Then, you list the function that you want to apply to each child node. In this case, you want to extract the contents of a leaf XML node. This, in combination with the first argument topxml, will make sure that you will do this for each leaf XML node.

Lastly, you put the values in a dataframe! You use the data.frame() function in combination with the matrix transpostition function t() to do this. Additionally you also specify that no row names should be included:
xml_df <- data.frame(t(topxml), row.names=NULL)

You can also choose not to do all the previous steps, which are a bit more complicated, and to just do the following:
url <- "a URL with XML data" data_df <- xmlToDataFrame(url)

Importing Data From HTML Tables Into R

From HTML tables into R is pretty straightforward:
# Assign your URL to `url` url <- "a URL" # Read the HTML table data_df <- readHTMLTable(url, which=3)

Note that the which argument allows you to specify which tables to return from within the document. If this gives you an error in the nature of “failed to load external entity”, don’t be confused: this error has been signaled by many people. You can work around this by using the RCurl package in combination with the XML package to read in your data:
# Activate the libraries library(XML) library(RCurl) # Assign your URL to `url` url <- "YourURL" # Get the data urldata <- getURL(url) # Read the HTML table data <- readHTMLTable(urldata, stringsAsFactors = FALSE)

Note that you don’t want the strings to be registered as factors or categorical variables! You can also use the httr package to accomplish exactly the same thing, except for the fact that you will want to convert the raw objects of the URL’s content to characters by using the rawToChar argument:
# Activate `httr` library(httr) # Get the URL data urldata <- GET(url) # Read the HTML table data <- readHTMLTable(rawToChar(urldata$content), stringsAsFactors = FALSE)

RStudio includes a data viewer that allows you to look inside data frames and other rectangular data structures. The viewer also allows includes some simple exploratory data analysis features that can help you understand the data as you manipulate it with R. You can invoke the viewer in a console by calling the View function on the data frame you want to look at. For instance, to view the built-in iris dataset, run these commands:

Please don't be dissapointed that cell does not execute this code: if you copy it into your RStudio, it will work. You can also start the viewer by clicking on the table data icon on the right, in the environment pane.

Section 2: Modifying Data

Projects often involve so much data that it is difficult to analyze all of the data at once. We present some methods to manipulate data in order to make the data more manageable.

Subsection: Sorting and Filtering Data

R contains useful features for sorting and filtering data so that one can more easily identify patterns.

For example, with the data.frame below I would like to sort by column z (descending) then by column b (ascending):

You can use the order() function directly without resorting to add-on tools -- see this simpler answer which uses a trick right from the top of the example(order) code:

Should work the same way, but you can't use with. Try M <- matrix(c(1,2,2,2,3,6,4,5), 4, 2, byrow=FALSE, dimnames=list(NULL, c("a","b"))) to create a matrix M, then use M[order(M[,"a"],-M[,"b"]),] to order it on two columns.

To begin understanding how to properly sort data frames in R, we of course must first generate a data frame to manipulate.

Note: The spacing isn’t necessary, but it improves legibility.

Executing our run.R script outputs the list of vectors in our data frame as expected, in the order they were entered.

The Order Function

While perhaps not the easiest sorting method to type out in terms of syntax, the one that is most readily available to all installations of R, due to being a part of the base module, is the order function.

The order function accepts a number of arguments, but at the simplest level the first argument must be a sequence of values or logical vectors.

RStudio includes a data viewer that allows you to look inside data frames and other rectangular data structures. The viewer also allows includes some simple exploratory data analysis (EDA) features that can help you understand the data as you manipulate it with R. You can invoke the viewer in a console by calling the View function on the data frame you want to look at. For instance, to view the built-in iris dataset, run these commands:

Sorting a Data Frame by Vector Name

As you might expect, you can sort by any column by just by clicking on the column. Click on a column that’s already sorted to reverse the sort direction.

For example, we can use order() to simply sort a vector of five randomly ordered numbers with this script:

Executing the script, we see the initial output of the unordered vector, followed by the now ordered list afterward.

With the order() function in our tool belt, we’ll start sorting our data frame by passing in the vector names within the data frame.

For example, using our previously generated dataframe object, we can sort by the vector z by adding the following code to our script:

What we’re effectively doing is calling our original dataframe object, and passing in the new index order that we’d like to have. This index order is generated using the with() function, which effectively creates a new environment using the passed in data in the first argument along with an expression for evaluating that data in the second argument.

Thus, we’re reevaluating the dataframe data using the order() function, and we want to order based on the z vector within that data frame. This returns a new index order for the data frame values, which is then finally evaluated within the [brackets] of dataframe[], outputting our new ordered result.

Consequently, we see our original unordered output, followed by a second output with the data sorted by column z.

Sorting by Column Index

Similar to the above method, it’s also possible to sort based on the numeric index of a column in the data frame, rather than the specific name.

Instead of using the with() function, we can simply pass the order() function to our dataframe. We indicate that we want to sort by the column of index 1 by using the dataframe[,1] syntax, which causes R to return the levels (names) of that index 1 column. In other words, similar to when we passed in the z vector name above, order is sorting based on the vector values that are within column of index 1:

As expected, we get our normal output followed by the sorted output in the first column:

Sorting by Multiple Columns

In some cases, it may be desired to sort by multiple columns. Thankfully, doing so is very simple with the previously described methods.

To sort multiple columns using vector names, simply add additional arguments to the order() function call as before:

Similarly, to sort by multiple columns based on column index, add additional arguments to order() with differing indices:

How to sort in decreasing order

Just like sort(), the order() function also takes an argument called decreasing. For example, to sort some.states in decreasing order of population:

Suppose we want to sort a vector, matrix, or data frame.

You’ll see the age of the first tree change from 118 to 120 in the viewer.

This auto-refreshing feature has some prerequisites, so if it doesn’t seem to be working:
You must call View() on a variable directly. If, for instance, you call
View(as.data.frame(foo)) or View(rbind(foo, bar)) you’re invoking View() on a new object created by evaluating your expression, and while that object contains data, it’s just a copy and won’t update when foo and bar do.

The number of rows the viewer can display is effectively unbounded, and large numbers of rows won’t slow down the interface. It uses the DataTables JavaScript library to virtualize scrolling, so only a few hundred rows are actually loaded at a time.

While rows are unbounded, columns are capped at 100. It’s not currently possible to virtualize columns in the same way as rows, and large numbers of columns cause the interface to slow significantly.

Finally, while we’ve made every effort to keep things speedy, very large amounts of data may cause sluggishness, especially when a sort or filter is applied, as this requires R to fully scan the frame. If you’re working with large frames, try applying filters to reduce it to the subset you’re interested in to improve performance.

Data frames

To sort a data frame on one or more columns, you can use the arrange function from plyr package, or use R’s built-in functions. The arrange function is much easier to use, but does require the external package to be installed.

Note that the size column is a factor and is sorted by the order of the factor levels. In this case, the levels were automatically assigned alphabetically (when creating the data frame), so large is first and small is last.

Reverse sort

The overall order of the sort can be reversed with the argument decreasing=TRUE.

To reverse the direction of a particular column, the method depends on the data type:

Numbers: put a - in front of the variable name, e.g. df[ order(-df$weight), ].
Factors: convert to integer and put a - in front of the variable name, e.g. df[ order(-xtfrm(df$size)), ].
Characters: there isn’t a simple way to do this. One method is to convert to a factor first and then sort as above.

------------------------------------------------------- http://sites.stat.psu.edu/~drh20/R/html/base/html/sort.html Conditional Formatting in Data Frames https://www.rdocumentation.org/packages/condformat/versions/0.7.0 The example is properly formatted at http://zeehio.github.io/condformat. R-script: --------------------------------------------------- data(iris) library(condformat) condformat(iris[c(1:5,70:75, 120:125),]) %>% rule_fill_discrete(Species) %>% rule_fill_discrete(c(Sepal.Width, Sepal.Length), expression = Sepal.Width > Sepal.Length - 2.25, colours = c("TRUE" = "#7D00FF")) %>% rule_fill_gradient2(Petal.Length) %>% rule_css(Sepal.Length, expression = ifelse(Species == "setosa", "bold", "regular"), css_field = "font-weight") %>% rule_css(Sepal.Length, expression = ifelse(Species == "setosa", "yellow", "black"), css_field = "color") -------------------------------------------------------- More Examples on Styling Cells, Rows, and Tables: https://rstudio.github.io/DT/010-style.html Filtering data: https://stackoverflow.com/questions/1686569/filter-data-frame-rows-by-a-logical-condition http://www.rexamples.com/11/Filtering%20data https://www.rdocumentation.org/packages/dplyr/versions/0.7.3/topics/filter https://blog.exploratory.io/filter-data-with-dplyr-76cf5f1a258e https://stats.stackexchange.com/questions/6187/filtering-a-dataframe

Section 3: Creating Distributions

Making a frequency distribution table at first sight looks more complicated in R than in other programs. The reason for this, however, is that you use the same commands for the simplest table as for the more complicated. So, the commands are slightly harder to start with, but remain the same to make all types of tables, included grouped frequency distribution tables.

First, you have to define the spans you are interested in. Suppose we are interested in all spans from 3 to 11. This is defined by the command:

span = seq(3,11,by=1)

This commands defines the lower values of the intervals as the sequence of values from 3 to 11 with a step function of 1 (i.e., it will go from 3 to 4, to 5, etc). If we define by=2, then the lower values of the spans will go from 3 to 5, to 7, and so on, which means that we will have a grouped frequency distribution table.

Next we segment the continuum of values in Data_STM according to the spans we defined above. We do this with the following command:

span.cut = cut(Data_STM$STM.span, span, right=FALSE)

This commands cuts the values of the variable STM.span from the dataset Data_STMspan.cut. The following parts are important:

Data_STM$STM.span

This is the variable STM.span from the dataset Data_STM. Notice that we must use uppercase if these have been used in the names. Also notice that R has rewritten our original name STM_span into STM.span (as you see when you listed the variable, or when you looked at Data_STM in the upper left panel). Variables from a dataset are defined as dataset$variable (i.e., with a $sign between the name of the dataset and the name of the variable).

Span refers to the lower values of the intervals we defined above.

right=FALSE makes that the lower value of the next interval will be excluded from the present interval. If we had not done this, the interval 3 ‐ 4 would exclude the number 3 and include the number 4. Now the interval goes from 3.0 to 3.99999....

span.freq = table(span.cut)

To calculate the frequencies of the intervals (STM ‐ spans) we use the commands:

 lapply(df, function(x) {
    if (is.numeric(x)) return(summary(x))
    if (is.factor(x)) return(table(x))
})

or you could use the count() function that comes with the plyr package

 install.packages("plyr") #if you don't have it already.
library(plyr)
ds <- matrix(c(sample(1:3,6,replace=T),sample(4:6,6,replace=T)),ncol=2)
#ds is the generated dataset, your data would go here
apply(ds,2,count)

Frequency tables should be calculated for categorical variables only (or for quantitative variables with a very small number of unique values), while for quantitative variables one would use descriptive statistics like quartiles, mean etc., all of those being specific to the function summary(). If you need to generate a summary table with frequencies for all variables/columns in RStudio, type the following commands:

Section 4: Measures of Location

The most important aspect of studying the distribution of a sample of measurements is locating the position of a central value about which the measurements are distributed. The arithmetic mean (average) of a set of n measurements $ X_1 , X_2 , \ldots , X_n $ is given by the formula

\[ \overline{X} = \frac{1}{n}\, \sum_{i=1}^n X_i = \frac{X_1 + X_2 + \cdots + X_n}{n} . \]

The mean provides a measure of central location for the data. If the data are for a sample (typically the case), the mean is denoted by $ \overline{X} . $ The sample mean is a point estimate of the (typically unknown) population mean for the variable of interest. If the data for the entire population are available, the population mean is computed in the same manner, but denoted either by $ E[X] , $ or by the Greek letter μ.

If the data are organized in the frequency distribution table then we can calculate the mean by the formula

\[ \overline{X} = \frac{1}{k}\, \sum_{i=1}^k n_i X_i , \]

where $ n_1 , n_2 , \ldots , n_k $ are frequencies of variable varieties $ X_1 , X_2 , \ldots , X_k . $

Elementary properties of the arithmetic mean:

the sum of deviations between the values and the mean is equal to zero:
\[ \sum_{i=1}^k \left( X_i - \overline{X} \right) =0 ; \]
if the variable is constant then the mean is equal to this constant:
\[ \frac{1}{k}\, \sum_{i=1}^k c = c; \]
if we add a constant to the values of the variable, then
\[ \frac{1}{k}\, \sum_{i=1}^k \left( X_i + c \right) = c + \overline{X} ; \]
if we multiply the values of the variable by a constant c, then

\[ \frac{1}{k}\, \sum_{i=1}^k c\cdot X_i = c \cdot \overline{X} . \]

The harmonic mean of a set of n measurements $ X_1 , X_2 , \ldots , X_k $ is defined by the formula

\[ \overline{X}_H = \frac{n}{\sum_{i=1}^n X_i^{-1}} . \]

In certain situations, especially many situations involving rates and ratios, the harmonic mean provides the truest average.

The geometric mean of a set of n measurements $ X_1 , X_2 , \ldots , X_k $ is defined by the formula

\[ \overline{X}_G = \left( X_1 \cdot X_2 \cdot \cdots \cdot X_n \right)^{1/n} = \sqrt[n]{X_1 \cdot X_2 \cdot \cdots \cdot X_n} . \]

The geometric mean may be more appropriate than the arithmetic mean for describing percentage growth.

Suppose an apple tree yields 50 oranges one year, then 60, 80 and 95 the following years, so the growth is 20 %, 60 % and 90 % for each of the years. Using the arithmetic mean, we can calculate an average growth as 56.66 % (20 % + 60 % + 90 % divided by 3). However, if we start with 50 apples and let it grow with (56+2/3) % for three years, the result is 220 applees, not 95.

Example: Calculate the arithmetic, harmonic and geometric mean of the first 10 Fibonacci numbers, $ F_{n+2} = F_{n+1} + F_n , \quad F_0 =0, \ F_1 =1 . $

The quantile x_p is the value of the variable which fulfils that 100p% of values of ordered sample (or population) are smaller or equal to x_p and 100(1−p) % of values of ordered sample (or population) are larger or equal to x_p.
The quantile is not uniquely defined.

There are three possible methods of calculating quantiles.

Sort the data in ascending order. Find the sequential index i_p of the quantile x_p that satisfies the inequalities
\[ n\, p < i_p < n\,p +1 . \]
The quantile x_p is then equal to the value of variable with the sequential index $ i_p - x_p = \langle x_p \rangle . $ If np and <np+1 are integers, we calculate the quantile as an aritmetic mean of $ \langle x_{np} \rangle $ and $ \langle x_{np+1} \rangle : $
\[ x_p = \frac{1}{2} \left( \langle x_{np} \rangle + \langle x_{np+1} \rangle \right) . \]
According to matlab, we calculate
\[ \overline{i_p} = \frac{np+np+1}{2} = \frac{2np+1}{2} , \]
which determine the location of the quantile. Using linear interpolation we get
\[ x_p = \langle x_{\lfloor \overline{i_p} \rfloor} \rangle + \left( \langle x_{\lfloor \overline{i_p +1} \rfloor} - \langle x_{\lfloor \overline{i_p} \rfloor} \right) \left( \overline{i_p} - \lfloor \overline{i_p} \rfloor \right) , \]
where $ \lfloor \cdot \rfloor $ denotes the integer part of the number, called the floor. If $ \overline{i_p} < 1 , $ then $ x_p = \langle x_{1} \rangle ; $ if $ \overline{i_p} > n, $ then $ x_p = \langle x_{n} \rangle . $
According to EXCEL, we assign values
\[ 0, \frac{1}{n-1} , \frac{2}{n-1} , \ldots , \frac{n-2}{n-1} \]
to the data sorted in ascending order. If p is equal to the multiple of $ \frac{1}{n-1} , $ the quantile x_p is equal to the value corresponding to the given multiple. If p is not the multiple of $ \frac{1}{n-1} , $ the inear interpolation is used.

The n-th percentile of an observation variable is the value that cuts off the first n percent of the data values when it is sorted in ascending order.

The median of an observation variable is the value at the middle when the data is sorted in ascending order. It is an ordinal measure of the central location of the data values.

We apply the median function to compute the median value of eruptions.

The mode $ \hat{X} $ is the value of variable with the highest frequency. In the case of continuous variable (data) the mode is the value where the histogram reaches its peak.

Section 5: Measures of Variability

Means, quantiles and a mode – measures of location – describe one property of frequency distribution – location. Another important property is dispersion (variation) which we describe by several measures of variation.

The range of variation R is defined as difference between the largest and the smallest value of the variable

\[ R = X_{\max} - X_{\min} . \]

It is the simplest but the rawest measure of variation. It indicates the width of the interval where all values are included.

The interquartile range:

\[ R_Q = X_{0.75} - X_{0.25} . \]

The interdecile range:

\[ R_D = X_{0.90} - X_{0.10} . \]

The nterpercentile range:

\[ R_C = X_{0.99} - X_{0.01} . \]

The interquartile range indicates the width of the interval which includes 50 % of middle values of ordered sample. By analogy the interdecile or the interpercentile range indicatethe width of the interval which includes 80 % or 98 % of middle values of ordered sample.

We have calculated quantiles of the data 2, 5, 7, 10, 12, 13, 18 and 21. We have the following values:
X_0.10 =2, X_0.25 =6, X_0.50 =11, X_0.75 =15.5, X_0.90 =2.

The range of variation is $ R= X_{\max} - X_{\min} = 21 - 2 =19 . $
The interquartile range is $ R_Q = X_{0.75} - X_{0.25} = 15.5 - 6 =9.5 . $
The interdecile range is $ R_D = X_{0.90} - X_{0.10} = 21 - 2 =19 . $

The quartile deviation is defined by the formula

\[ Q = R_Q /2 . \]

The decile deviation is defined by the following formula:

\[ D = R_D /8 . \]

The percentile deviation is defined by the formula

\[ C = R_C /98 . \]

Example: Calculate the quartile and the decile deviation of 2, 5, 7, 10, 12, 13, 18 and 21. The quartile deviation is
$ Q= R_Q /2 = 9.5/2 =4.75 . $
The decile deviation is $ D= R_D /8 = 19/8 =2.375 . $
It means that the average width of two (eight) middle quartile (decile) intervals is 4.75 (2.375).

The average deviation is defined as the arithmetic mean of the absolute deviations

\[ d_{\overline{X}} = \frac{1}{n} \, \sum_{i=1}^n \left\vert X_i - \overline{X} \right\vert . \]

Find the average deviation of a data set 1, 2, 5, 6, 7, 8, 8 and 9. Since the arithmetic mean is $ \overline{X} = 5.75 , $ we obtain

\begin{eqnarray*} d_{\overline{X}} &=& \frac{1}{8} \left[ |1 - 5.75| + |2- 5.75|+|5 - 5.75|+ |6 - 5.75| \right] + \\ && \frac{1}{8} \left[ |7 - 5.75| + |8-5.75| + |8-5.75| + |9-5.75| \right] = 2.3125 . \end{eqnarray*}

Subtitle: Variance

The variance s_n² is defined as the arithmetic mean of squares of deviations

\[ s_n^2 = \frac{1}{n} \, \sum_{i=1}^n \left\vert X_i - \overline{X} \right\vert^2 . \]

Expanding the sum above, we get

\begin{eqnarray*} s_n^2 &=& \frac{1}{n} \left( \sum_{i=1}^n X_i^2 - 2\,\overline{X} \,\sum_{i=1}^n X_i + \sum_{i=1}^n \overline{X}^2 \right) \\ &=& \frac{1}{n} \left[ \sum_{i=1}^n X_i^2 - 2\,n\,\overline{X}^2 + n\,\overline{X}^2 \right) \\ &=& \frac{1}{n} \, \sum_{i=1}^n X_i^2 - \overline{X}^2 = \overline{X^2} -\overline{X}^2 . \end{eqnarray*}

Elementary properties of the variance:

if the variable is constant, then the variance is zero.
if we add a constant to the values of the variable, then
\[ s_n^2 = \frac{1}{n} \, \sum_{i=1}^n \left[ \left( X_i + c \right) - \left( \overline{X} + c \right) \right]^2 . \]
f we multiply the values of the variable by a constant c, then
\[ \frac{1}{n} \, \sum_{i=1}^n \left( c \cdot X_i - c \cdot \overline{X} \right)^2 = c^2 \cdot s_n^2 . \]

The square root of the variance is called standard deviation

\[ s_n = \sqrt{s_n^2} . \]

The sample variance s² if defined by the formula

\[ s^2 = \frac{1}{n-1} \, \sum_{i=1}^n \left( \cdot X_i - \overline{X} \right)^2 . \]

The square root of the sample variance is called sample standard deviation

\[ s = \sqrt{s^2} . \]

It is obvious that

\[ s_n^2 = \frac{n-1}{n} \, s^2 . \]

Example: Calculate the variance, the standard deviation, the sample variance and the sample standard deviation of the data set 1, 2, 5, 6, 7, 8, 8 and 9.

The arithmetic mean is $ \overline{X} = 5.75 . $ So we have

\begin{eqnarray*} s_n^2 &=& \frac{1}{8} \left[ |1 - 5.75|^2 + |2- 5.75|^2 +|5 - 5.75|^2 + |6 - 5.75|^2 \right] + \\ && \frac{1}{8} \left[ |7 - 5.75|^2 + |8-5.75|^2 + |8-5.75|^2 + |9-5.75|^2 \right] = 7.4375 . \end{eqnarray*}

The variace can be also calculated by the formula $ s_n^2 = \overline{X^2} - \overline{X}^2 . $

\begin{eqnarray*} \overline{X^2} &=& \frac{1}{n}\, \sum_{i=1}^n X_i^2 = \frac{1}{8} \left[ 1^2 + 2^2 + 3^2 + 4^2 +5^2 +6^2 + 7^2 +8^2 + 9^2 \right] = 40.5 , \\ s_n^2 &=& \overline{X^2} - \overline{X}^2 = 40.5 - 5.75^2 = 7.4375 . \end{eqnarray*}

The standard deviation is

\[ s_n = \sqrt{s_n^2} = \sqrt{7.4375} \approx 2.72718 . \]

To get the sample variation we apply the formula

\[ s^2 = \frac{n}{n-1}\, s_n^2 = \frac{8}{7}\cdot 7.4375 = 8.5 . \]

The sample standard deviation is

\[ s = \sqrt{s^2} = \sqrt{8.5} \approx 2.91548 . \]

Section 6: Measures of Concentration

Often we want to show that some random quantity is close to its mean with high probability. Results of this kind are known as concentration of measure, which was first introduced in the early 1970s by the Russian mathematician Vitali Milman (born in 1939). In this section, we consider some simple concentration results and suggest the reader to look at important inequalities such as Hoeffding’s inequality, Bernstein’s inequality, and McDiarmid’s inequality at other sources.

We start with some basic definitions. Let X be a discrete random varibale. Its r-th moment is defined by the formula

\[ m'_r = \frac{1}{n}\, \sum_{i=1}^n X_i^r . \]

The r-th central moment is defined by the formula

\[ m_r = \frac{1}{n}\, \sum_{i=1}^n \left( X_i - \overline{X} \right)^r , \]

where $ \overline{X} = m_1 $ is the mean value of n values of X.

Moments can be calculated with R as follows:

Another option is to use the function moment from the e1071 package. As it is not in the core R library, the package has to be installed and loaded into the R workspace.

The sample skewness is defined by the formula

\[ a_3 = \frac{m_3}{m_2^{3/2}} = \frac{1}{n\,s_n^3} \, \sum_{i=1}^n \left( X_i - \overline{X} \right)^3 . \]

The skewness of a data population is defined by the following formula, where μ₂ and μ₃ are the second and third central moments.

\[ \gamma_1 = \frac{\mu_3}{\mu_2^{3/2}} . \]

Intuitively, the skewness is a measure of symmetry. As a rule, negative skewness indicates that the mean of the data values is less than the median, and the data distribution is left-skewed. Positive skewness would indicate that the mean of the data values is larger than the median, and the data distribution is right-skewed.

To calculate the skewness coefficient (of eruptions) one needs the function skewness from the e1071 package. As the package is not in the core R library, it has to be installed and loaded into the R workspace.

The kurtosis of a univariate population is defined by the following formula, ... moments . Intuitively, the kurtosis describes the tail shape of the data distribution. The normal distribution has zero kurtosis and thus the standard tail shape. It is said to be mesokurtic . ...
The sample kurtosis is defined by formula

\[ a_4 = \frac{m_4}{m_2^{2}} -3 = \frac{1}{n\,s_n^4} \, \sum_{i=1}^n \left( X_i - \overline{X} \right)^4 . \]

Note that Excel functions SKEW and KURT calculate skewness and kurtosis by formulas

\begin{eqnarray*} a_3^{\ast} &=& \frac{n}{(n-1)(n-2)} \,\sum_{i=1}^n \left( \frac{X_i - \overline{X}}{s} \right)^3 , \\ a_4^{\ast} &=& \frac{n(n+1)}{(n-1)(n-2)(n-3)} \,\sum_{i=1}^n \left( \frac{X_i - \overline{X}}{s} \right)^4 . \end{eqnarray*}

We can related them to ours:

\begin{eqnarray*} a_3 &=& \frac{n-2}{} \, a_3^{\ast} , \\ a &=& \frac{(n-2)(n-3)}{n^2 -1} \,a_4^{\ast} - \frac{6}{n+1} . \end{eqnarray*}

Section 7: Measures Between Two Variables

Nothing here yet... Check back in a few weeks.