// Remove fullscreen button from SageCell.

An R TUTORIAL for Statistics Applications

Part 2: Data Visualization

This chapter covers basic information regarding data visualisation using R.

Email Vladimir Dobrushkin

The first step in trying to interpret data is often to visualize it in some way. Data visualization can be as simple as creating a summary table, or it could require generating charts to help interpret, analyze, and learn from the data. Data visualization is very helpful for identifying data errors and reducing the size of your data set by highlighting important relationships and trends.

The first step in trying to interpret data is often to visualize it in some way. Data visualization can be as simple as creating a summary table, or it could require generating charts to help interpret, analyze, and learn from the data. Data visualization is very helpful for identifying data errors and reducing the size of your data set by highlighting important relationships and trends.

a. The Comprehensive R Archive Network (CRAN), or R for short, has a well-established plotting engine. Creating plots in R is simple, powerful, and effective, for a wide variety of applications. One of the main reasons that data analysts and data scientists turn to R is for its strong graphic capabilities. The extensive online documentation allows for quick troubleshooting, and a massive library of packages allows R to be extended to nearly an application.

Effective Design Techniques

We can store the data in R as two separate vectors via the following command:

Once you execute the code, there is no output. This is because you did not ask to display the input. One can eitehr use View command or plot.

Given X and Y from previous exercises, we can plot X vs. Y.

Within this plot, we have a nice range of customizability, from point types, to adding titles and changing the X and Y axis titles, to adding data labels, etc. To add a title, we call the command:

A handy trick to get some a smoothed line to best display this information:

In order to call a line graph, we will simply access the graphical parameters within our plot() command. The parameter “type” will allow us to switch from scatter to line, or line with points.

We can also add data labels by calling an additional line, using the text() command:

The “pos = ” addition allows us to set where we want said data labels to show up. For more graphical parameters, follow this link and explore:

https://www.statmethods.net/advgraphs/parameters.html

The Order Function



Several packages support making beauutiful tables with R, such as

  • [xtable](https://cran.r-project.org/web/packages/xtable)
  • [stargazer](https://cran.r-project.org/web/packages/stargazer)
  • [pander](http://rapporter.github.io/pander/)
  • [tables](https://cran.r-project.org/web/packages/tables)
  • [ascii](http://eusebe.github.io/ascii/)

There are know other options to plot tables: knitr. By default, R Markdown displays data frames and matrices as they would be in the R terminal (in a monospaced font). There are many packages that make life easier.

Here is how you install them (make sure you have an internet connection):

mylibraries = c(
  "data.table",  # fast and compact data processing
  "stringr",     # work with character strings
  "ggplot2",     # comprehensive plotting environment
  "zoo",         # work with time series
  "lubridate",   # work with dates
  "DescTools",   # descriptive analytics
  "readxl"       # read Excel files
)

# install.packages(mylibraries, dep = T)

If you run these commands, remove the “#” (comment) sign. Alternatively, you can install packages in RStudio via a menu: /Tools/Install packages/ (…type the names of the packages…).

Now that the packages are installed, you have to load them before you can work with them. Here is how you do that:

library(data.table)
library(stringr)
library(ggplot2)
library(zoo)
library(lubridate)
library(DescTools)

The table function is a very basic, but essential, function to master while performing interactive data analyses. It simply creates tabular results of categorical variables. However, when combined with the powers of logical expressions in R, you can gain even more insights into your data, including identifying potential problems. Suppose we want to know how many subjects are under the age of 60 in a clinical trial. The table function simply needs an object that can be interpreted as a categorical variable (called a “factor” in R).

If you want to know how many subjects are enrolled at each center, so simply pass in the variable “center” to the table function.

## a simple example of a table call
table(clinical.trial$center)
Center A Center B Center C Center D Center E
      22       10       28       23       17  

If one needs to create a logical vector indicating whether or not a patient is under 60 or not. We can then pass that into the table function. Also, since there are missing ages, we might be interested in seeing those in the table also. It is shown both ways by setting the “useNA” argument to table.

## a logical vector is created and passed into table
table(clinical.trial$age < 60)
FALSE  TRUE
   41    39
## the useNA argument shows the missing values, too
table(clinical.trial$age < 60, useNA = "always")
FALSE  TRUE  
   41    39    20   

For example, finding the center that has the most missing values for age, sounds the trickiest, but is once again an extremely simple task with the table function. You just need to know that the is.na function returns a logical vector that indicates whether an observation is missing or not.

## the table of missing age by center
table(clinical.trial$center, is.na(clinical.trial$age))
           FALSE TRUE
  Center A    16    6
  Center B     8    2
  Center C    23    5
  Center D    20    3
  Center E    13    4
## centers with most missing ages listed in order
## highest to lowest
sort(table(clinical.trial$center, is.na(clinical.trial$age))[, 2],
       decreasing = TRUE)
Center A Center C Center E Center D Center B
       6        5        4        3        2  

Although table is an extremely simple function, its use should not be avoided when exploring a dataset. These examples have shown you how to use table on variables in a dataset, and on variables created from a logical expression in R. The “useNA” argument was also introduced.

Bar Charts and Column Charts

We can produce Histograms for a vector, such that we see the frequency of the values represented via the hist() command.

Note: All hist() commands follow the same graphical parameter inputs as any plots, thus allowing you to change the color of the blocks, the title and axes titles, add a legend, etc.

We can conditionally format the color of each bar (or any other parameter of the plot) using a simple ifelse statement:

Bubble Charts

For Bubble Charts, we will use data provided alongside the “Plotly” R package. This package allows users to create local, interactive HTML widgets and plots such that they are dynamic in use. Our first example will be using R to show the Gender Gap in Earnings per University.

Note: when using packages we must first install the package, prior to use. To do this we will call the command:

Our next step is to open the data, and to do this we will use a read.csv() command, to read a CSV file into R via the internet or you computer files.

Now, we can open the library (install only has to happen once, but opening the library must happen each use.)

Now, we will run a script to create the graph. This call is slightly nuanced, but each parameter is listed on the plotly website (https://plot.ly/r/bubble-charts/)



The most important aspect of studying the distribution of a sample of measurements is locating the position of a central value about which the measurements are distributed. The arithmetic mean (average) of a set of n measurements \( X_1 , X_2 , \ldots , X_n \) is given by the formula

\[ \overline{X} = \frac{1}{n}\, \sum_{i=1}^n X_i = \frac{X_1 + X_2 + \cdots + X_n}{n} . \]

The mean provides a measure of central location for the data. If the data are for a sample (typically the case), the mean is denoted by \( \overline{X} . \) The sample mean is a point estimate of the (typically unknown) population mean for the variable of interest. If the data for the entire population are available, the population mean is computed in the same manner, but denoted either by \( E[X] , \) or by the Greek letter μ.

If the data are organized in the frequency distribution table then we can calculate the mean by the formula

\[ \overline{X} = \frac{1}{k}\, \sum_{i=1}^k n_i X_i , \]
where \( n_1 , n_2 , \ldots , n_k \) are frequencies of variable varieties \( X_1 , X_2 , \ldots , X_k . \)

Elementary properties of the arithmetic mean:

  • the sum of deviations between the values and the mean is equal to zero:
    \[ \sum_{i=1}^k \left( X_i - \overline{X} \right) =0 ; \]
  • if the variable is constant then the mean is equal to this constant:
    \[ \frac{1}{k}\, \sum_{i=1}^k c = c; \]
  • if we add a constant to the values of the variable, then
    \[ \frac{1}{k}\, \sum_{i=1}^k \left( X_i + c \right) = c + \overline{X} ; \]
  • if we multiply the values of the variable by a constant c, then
  • \[ \frac{1}{k}\, \sum_{i=1}^k c\cdot X_i = c \cdot \overline{X} . \]

The harmonic mean of a set of n measurements \( X_1 , X_2 , \ldots , X_k \) is defined by the formula

\[ \overline{X}_H = \frac{n}{\sum_{i=1}^n X_i^{-1}} . \]
In certain situations, especially many situations involving rates and ratios, the harmonic mean provides the truest average.
The geometric mean of a set of n measurements \( X_1 , X_2 , \ldots , X_k \) is defined by the formula

\[ \overline{X}_G = \left( X_1 \cdot X_2 \cdot \cdots \cdot X_n \right)^{1/n} = \sqrt[n]{X_1 \cdot X_2 \cdot \cdots \cdot X_n} . \]

The geometric mean may be more appropriate than the arithmetic mean for describing percentage growth.

Suppose an apple tree yields 50 oranges one year, then 60, 80 and 95 the following years, so the growth is 20 %, 60 % and 90 % for each of the years. Using the arithmetic mean, we can calculate an average growth as 56.66 % (20 % + 60 % + 90 % divided by 3). However, if we start with 50 apples and let it grow with (56+2/3) % for three years, the result is 220 applees, not 95.

Example: Calculate the arithmetic, harmonic and geometric mean of the first 10 Fibonacci numbers, \( F_{n+2} = F_{n+1} + F_n , \quad F_0 =0, \ F_1 =1 . \)

The quantile xp is the value of the variable which fulfils that 100p% of values of ordered sample (or population) are smaller or equal to xp and 100(1−p) % of values of ordered sample (or population) are larger or equal to xp.
The quantile is not uniquely defined.

There are three possible methods of calculating quantiles.

  1. Sort the data in ascending order. Find the sequential index ip of the quantile xp that satisfies the inequalities
    \[ n\, p < i_p < n\,p +1 . \]
    The quantile xp is then equal to the value of variable with the sequential index \( i_p - x_p = \langle x_p \rangle . \) If np and <np+1 are integers, we calculate the quantile as an aritmetic mean of \( \langle x_{np} \rangle \) and \( \langle x_{np+1} \rangle : \)
    \[ x_p = \frac{1}{2} \left( \langle x_{np} \rangle + \langle x_{np+1} \rangle \right) . \]
  2. According to matlab, we calculate
    \[ \overline{i_p} = \frac{np+np+1}{2} = \frac{2np+1}{2} , \]
    which determine the location of the quantile. Using linear interpolation we get
    \[ x_p = \langle x_{\lfloor \overline{i_p} \rfloor} \rangle + \left( \langle x_{\lfloor \overline{i_p +1} \rfloor} - \langle x_{\lfloor \overline{i_p} \rfloor} \right) \left( \overline{i_p} - \lfloor \overline{i_p} \rfloor \right) , \]
    where \( \lfloor \cdot \rfloor \) denotes the integer part of the number, called the floor. If \( \overline{i_p} < 1 , \) then \( x_p = \langle x_{1} \rangle ; \) if \( \overline{i_p} > n, \) then \( x_p = \langle x_{n} \rangle . \)
  3. According to EXCEL, we assign values
    \[ 0, \frac{1}{n-1} , \frac{2}{n-1} , \ldots , \frac{n-2}{n-1} \]
    to the data sorted in ascending order. If p is equal to the multiple of \( \frac{1}{n-1} , \) the quantile xp is equal to the value corresponding to the given multiple. If p is not the multiple of \( \frac{1}{n-1} , \) the inear interpolation is used.

The n-th percentile of an observation variable is the value that cuts off the first n percent of the data values when it is sorted in ascending order.

The median of an observation variable is the value at the middle when the data is sorted in ascending order. It is an ordinal measure of the central location of the data values.

We apply the median function to compute the median value of eruptions.

The mode \( \hat{X} \) is the value of variable with the highest frequency. In the case of continuous variable (data) the mode is the value where the histogram reaches its peak.



Means, quantiles and a mode – measures of location – describe one property of frequency distribution – location. Another important property is dispersion (variation) which we describe by several measures of variation.

The range of variation R is defined as difference between the largest and the smallest value of the variable

\[ R = X_{\max} - X_{\min} . \]
It is the simplest but the rawest measure of variation. It indicates the width of the interval where all values are included.

The interquartile range:

\[ R_Q = X_{0.75} - X_{0.25} . \]

The interdecile range:

\[ R_D = X_{0.90} - X_{0.10} . \]

The nterpercentile range:

\[ R_C = X_{0.99} - X_{0.01} . \]

The interquartile range indicates the width of the interval which includes 50 % of middle values of ordered sample. By analogy the interdecile or the interpercentile range indicatethe width of the interval which includes 80 % or 98 % of middle values of ordered sample.

We have calculated quantiles of the data 2, 5, 7, 10, 12, 13, 18 and 21. We have the following values:
X0.10 =2, X0.25 =6, X0.50 =11, X0.75 =15.5, X0.90 =2.

The range of variation is \( R= X_{\max} - X_{\min} = 21 - 2 =19 . \)
The interquartile range is \( R_Q = X_{0.75} - X_{0.25} = 15.5 - 6 =9.5 . \)
The interdecile range is \( R_D = X_{0.90} - X_{0.10} = 21 - 2 =19 . \)

The quartile deviation is defined by the formula

\[ Q = R_Q /2 . \]
The decile deviation is defined by the following formula:
\[ D = R_D /8 . \]

The percentile deviation is defined by the formula

\[ C = R_C /98 . \]

Example: Calculate the quartile and the decile deviation of 2, 5, 7, 10, 12, 13, 18 and 21. The quartile deviation is
\( Q= R_Q /2 = 9.5/2 =4.75 . \)
The decile deviation is \( D= R_D /8 = 19/8 =2.375 . \)
It means that the average width of two (eight) middle quartile (decile) intervals is 4.75 (2.375).

The average deviation is defined as the arithmetic mean of the absolute deviations

\[ d_{\overline{X}} = \frac{1}{n} \, \sum_{i=1}^n \left\vert X_i - \overline{X} \right\vert . \]

Find the average deviation of a data set 1, 2, 5, 6, 7, 8, 8 and 9. Since the arithmetic mean is \( \overline{X} = 5.75 , \) we obtain

\begin{eqnarray*} d_{\overline{X}} &=& \frac{1}{8} \left[ |1 - 5.75| + |2- 5.75|+|5 - 5.75|+ |6 - 5.75| \right] + \\ && \frac{1}{8} \left[ |7 - 5.75| + |8-5.75| + |8-5.75| + |9-5.75| \right] = 2.3125 . \end{eqnarray*}

Subtitle: Variance

The variance sn2 is defined as the arithmetic mean of squares of deviations

\[ s_n^2 = \frac{1}{n} \, \sum_{i=1}^n \left\vert X_i - \overline{X} \right\vert^2 . \]
Expanding the sum above, we get
\begin{eqnarray*} s_n^2 &=& \frac{1}{n} \left( \sum_{i=1}^n X_i^2 - 2\,\overline{X} \,\sum_{i=1}^n X_i + \sum_{i=1}^n \overline{X}^2 \right) \\ &=& \frac{1}{n} \left[ \sum_{i=1}^n X_i^2 - 2\,n\,\overline{X}^2 + n\,\overline{X}^2 \right) \\ &=& \frac{1}{n} \, \sum_{i=1}^n X_i^2 - \overline{X}^2 = \overline{X^2} -\overline{X}^2 . \end{eqnarray*}

Elementary properties of the variance:

  1. if the variable is constant, then the variance is zero.
  2. if we add a constant to the values of the variable, then
    \[ s_n^2 = \frac{1}{n} \, \sum_{i=1}^n \left[ \left( X_i + c \right) - \left( \overline{X} + c \right) \right]^2 . \]
  3. f we multiply the values of the variable by a constant c, then
    \[ \frac{1}{n} \, \sum_{i=1}^n \left( c \cdot X_i - c \cdot \overline{X} \right)^2 = c^2 \cdot s_n^2 . \]

The square root of the variance is called standard deviation

\[ s_n = \sqrt{s_n^2} . \]

The sample variance s2 if defined by the formula

\[ s^2 = \frac{1}{n-1} \, \sum_{i=1}^n \left( \cdot X_i - \overline{X} \right)^2 . \]
The square root of the sample variance is called sample standard deviation
\[ s = \sqrt{s^2} . \]
It is obvious that
\[ s_n^2 = \frac{n-1}{n} \, s^2 . \]

Example: Calculate the variance, the standard deviation, the sample variance and the sample standard deviation of the data set 1, 2, 5, 6, 7, 8, 8 and 9.

The arithmetic mean is \( \overline{X} = 5.75 . \) So we have

\begin{eqnarray*} s_n^2 &=& \frac{1}{8} \left[ |1 - 5.75|^2 + |2- 5.75|^2 +|5 - 5.75|^2 + |6 - 5.75|^2 \right] + \\ && \frac{1}{8} \left[ |7 - 5.75|^2 + |8-5.75|^2 + |8-5.75|^2 + |9-5.75|^2 \right] = 7.4375 . \end{eqnarray*}
The variace can be also calculated by the formula \( s_n^2 = \overline{X^2} - \overline{X}^2 . \)
\begin{eqnarray*} \overline{X^2} &=& \frac{1}{n}\, \sum_{i=1}^n X_i^2 = \frac{1}{8} \left[ 1^2 + 2^2 + 3^2 + 4^2 +5^2 +6^2 + 7^2 +8^2 + 9^2 \right] = 40.5 , \\ s_n^2 &=& \overline{X^2} - \overline{X}^2 = 40.5 - 5.75^2 = 7.4375 . \end{eqnarray*}
The standard deviation is
\[ s_n = \sqrt{s_n^2} = \sqrt{7.4375} \approx 2.72718 . \]
To get the sample variation we apply the formula
\[ s^2 = \frac{n}{n-1}\, s_n^2 = \frac{8}{7}\cdot 7.4375 = 8.5 . \]
The sample standard deviation is
\[ s = \sqrt{s^2} = \sqrt{8.5} \approx 2.91548 . \]


A data dashboard is a data-visualization tool that illustrates multiple metrics and automatically updates these metrics as new data become available. It is like an automobile's dashboard instrumentation that provides information on the vehicle's current speed, fuel level, and engine temperature so that a driver can access current operating conditions and take effective action. Similarly, a data dashboard provides the important metrics that mahages need to quickly assess the performance of their organization and react accordingly.

We start with some basic definitions. Let X be a discrete random varibale. Its r-th moment is defined by the formula

\[ m'_r = \frac{1}{n}\, \sum_{i=1}^n X_i^r . \]
The r-th central moment is defined by the formula
\[ m_r = \frac{1}{n}\, \sum_{i=1}^n \left( X_i - \overline{X} \right)^r , \]
where \( \overline{X} = m_1 \) is the mean value of n values of X.

Moments can be calculated with R as follows:

or

Another option is to use the function moment from the e1071 package. As it is not in the core R library, the package has to be installed and loaded into the R workspace.

The sample skewness is defined by the formula

\[ a_3 = \frac{m_3}{m_2^{3/2}} = \frac{1}{n\,s_n^3} \, \sum_{i=1}^n \left( X_i - \overline{X} \right)^3 . \]
The skewness of a data population is defined by the following formula, where μ2 and μ3 are the second and third central moments.
\[ \gamma_1 = \frac{\mu_3}{\mu_2^{3/2}} . \]
Intuitively, the skewness is a measure of symmetry. As a rule, negative skewness indicates that the mean of the data values is less than the median, and the data distribution is left-skewed. Positive skewness would indicate that the mean of the data values is larger than the median, and the data distribution is right-skewed.

To calculate the skewness coefficient (of eruptions) one needs the function skewness from the e1071 package. As the package is not in the core R library, it has to be installed and loaded into the R workspace.

The kurtosis of a univariate population is defined by the following formula, ... moments . Intuitively, the kurtosis describes the tail shape of the data distribution. The normal distribution has zero kurtosis and thus the standard tail shape. It is said to be mesokurtic . ...
The sample kurtosis is defined by formula

\[ a_4 = \frac{m_4}{m_2^{2}} -3 = \frac{1}{n\,s_n^4} \, \sum_{i=1}^n \left( X_i - \overline{X} \right)^4 . \]

Note that Excel functions SKEW and KURT calculate skewness and kurtosis by formulas

\begin{eqnarray*} a_3^{\ast} &=& \frac{n}{(n-1)(n-2)} \,\sum_{i=1}^n \left( \frac{X_i - \overline{X}}{s} \right)^3 , \\ a_4^{\ast} &=& \frac{n(n+1)}{(n-1)(n-2)(n-3)} \,\sum_{i=1}^n \left( \frac{X_i - \overline{X}}{s} \right)^4 . \end{eqnarray*}
We can related them to ours:
\begin{eqnarray*} a_3 &=& \frac{n-2}{} \, a_3^{\ast} , \\ a &=& \frac{(n-2)(n-3)}{n^2 -1} \,a_4^{\ast} - \frac{6}{n+1} . \end{eqnarray*}