// Remove fullscreen button from SageCell.

An R TUTORIAL for Statistics Applications

Part 1 - Section 2: Creating Distributions

This chapter covers basic information regarding data visualisation using R.

Email Vladimir Dobrushkin

Several packages support making beauutiful tables with R, such as

  • [xtable](https://cran.r-project.org/web/packages/xtable)
  • [stargazer](https://cran.r-project.org/web/packages/stargazer)
  • [pander](http://rapporter.github.io/pander/)
  • [tables](https://cran.r-project.org/web/packages/tables)
  • [ascii](http://eusebe.github.io/ascii/)

There are know other options to plot tables: knitr. By default, R Markdown displays data frames and matrices as they would be in the R terminal (in a monospaced font). There are many packages that make life easier.

Here is how you install them (make sure you have an internet connection):

mylibraries = c(
  "data.table",  # fast and compact data processing
  "stringr",     # work with character strings
  "ggplot2",     # comprehensive plotting environment
  "zoo",         # work with time series
  "lubridate",   # work with dates
  "DescTools",   # descriptive analytics
  "readxl"       # read Excel files
)

# install.packages(mylibraries, dep = T)

If you run these commands, remove the “#” (comment) sign. Alternatively, you can install packages in RStudio via a menu: /Tools/Install packages/ (…type the names of the packages…).

Now that the packages are installed, you have to load them before you can work with them. Here is how you do that:

library(data.table)
library(stringr)
library(ggplot2)
library(zoo)
library(lubridate)
library(DescTools)

The table function is a very basic, but essential, function to master while performing interactive data analyses. It simply creates tabular results of categorical variables. However, when combined with the powers of logical expressions in R, you can gain even more insights into your data, including identifying potential problems. Suppose we want to know how many subjects are under the age of 60 in a clinical trial. The table function simply needs an object that can be interpreted as a categorical variable (called a “factor” in R).

If you want to know how many subjects are enrolled at each center, so simply pass in the variable “center” to the table function.

## a simple example of a table call
table(clinical.trial$center)
Center A Center B Center C Center D Center E
      22       10       28       23       17  

If one needs to create a logical vector indicating whether or not a patient is under 60 or not. We can then pass that into the table function. Also, since there are missing ages, we might be interested in seeing those in the table also. It is shown both ways by setting the “useNA” argument to table.

## a logical vector is created and passed into table
table(clinical.trial$age < 60)
FALSE  TRUE
   41    39
## the useNA argument shows the missing values, too
table(clinical.trial$age < 60, useNA = "always")
FALSE  TRUE  
   41    39    20   

For example, finding the center that has the most missing values for age, sounds the trickiest, but is once again an extremely simple task with the table function. You just need to know that the is.na function returns a logical vector that indicates whether an observation is missing or not.

## the table of missing age by center
table(clinical.trial$center, is.na(clinical.trial$age))
           FALSE TRUE
  Center A    16    6
  Center B     8    2
  Center C    23    5
  Center D    20    3
  Center E    13    4
## centers with most missing ages listed in order
## highest to lowest
sort(table(clinical.trial$center, is.na(clinical.trial$age))[, 2],
       decreasing = TRUE)
Center A Center C Center E Center D Center B
       6        5        4        3        2  

Although table is an extremely simple function, its use should not be avoided when exploring a dataset. These examples have shown you how to use table on variables in a dataset, and on variables created from a logical expression in R. The “useNA” argument was also introduced.

Bar Charts and Column Charts

We can produce Histograms for a vector, such that we see the frequency of the values represented via the hist() command.

Note: All hist() commands follow the same graphical parameter inputs as any plots, thus allowing you to change the color of the blocks, the title and axes titles, add a legend, etc.

We can conditionally format the color of each bar (or any other parameter of the plot) using a simple ifelse statement:

Bubble Charts

For Bubble Charts, we will use data provided alongside the “Plotly” R package. This package allows users to create local, interactive HTML widgets and plots such that they are dynamic in use. Our first example will be using R to show the Gender Gap in Earnings per University.

Note: when using packages we must first install the package, prior to use. To do this we will call the command:

Our next step is to open the data, and to do this we will use a read.csv() command, to read a CSV file into R via the internet or you computer files.

Now, we can open the library (install only has to happen once, but opening the library must happen each use.)

Now, we will run a script to create the graph. This call is slightly nuanced, but each parameter is listed on the plotly website (https://plot.ly/r/bubble-charts/)