An R TUTORIAL for Statistics Applications:

Email Vladimir Dobrushkin

The purpose of this chapter is to provide an introduction to data mining methods for predicting an outcome based on a set of input variables, or features. These methods are also referred to as supervised learning.

Section 1: Data Sampling

A time series is a sequence of observations on a variable measured at successive points in time or over successive periods of time. We need a formal definition.

A stochastic process \( \{ X(t); \quad t\in T \} \) is a collection of random variables, where T is an index set, which represents time. If T takes on a continuous range of values (e.g., \( T = (- \infty , \infty ) \quad\mbox{or}\quad T= (0,\infty ) , \) the process is said to be a continuous parameter process. If, on other hand, T takes on a discrete set of values (e.g., \( T = \{ 0, 1, 2, \ldots \} \quad\mbox{or}\quad T = \{ 0, \pm 1, \pm 2, \ldots ) , \) the process is said to be a discrete parameter process.

Statisticians often have to take samples of data and then calculate statistics. Taking a sample is easy with R because a sample is really nothing more than a subset of data. To do so, you make use of sample(), which takes a vector as input; then you tell it how many samples to draw from that list.

Forecasting is required in many situations: deciding whether to build another building at the university campus in the next five years requires forecasts of future enrollment of students; scheduling staff in a call center next week requires forecasts of call volumes; stocking an inventory requires forecasts of stock requirements. Forecasts can be required several years in advance (for case of capital investments), or only a few minutes beforehand (for telecommunication routing).

Say you wanted to simulate rolls of a die, and you want to get fifteen results. Because the outcome of a single roll of a die is a number between one and six, your code looks like this:

If you provide a seed value, the random-number sequence will be reset to a known state. This is because R doesn’t create truly random numbers, but only pseudo-random numbers. A pseudo-random sequence is a set of numbers that, for all practical purposes, seem to be random but were generated by an algorithm. When you set a starting seed for a pseudo-random process, R always returns the same pseudo-random sequence.

You can use sample() to take samples from the data frame iris. In this case, you may want to use the argument replace=FALSE. Because this is the default value of the replace argument, you don’t need to write it explicitly:

R has powerful indexing features for accessing object elements. These features can be used to select and exclude variables and observations. The following code snippets demonstrate ways to keep or delete variables and observations and to take random samples from a dataset.

We use sample( ) function to take a random sample of size n from a dataset.

The function createDataPartition of the caret package is designed for random sampling while balancing the distribution of the critical variable.

We can visualize sampling:

A common example in business analytics data is to take a random sample of a very large dataset, to test your analytics code. Note most business analytics datasets are data.frame (records as rows and variables as columns) in structure or database bound. This is partly due to a legacy of traditional analytics software.

Here is how we do it in R:

Referring to parts of data.frame rather than whole dataset. Using square brackets to reference variable columns and rows: dataset[i,k] refers to element in the i-th row and j-th column. The notation dataset[i,] refers to all elements in the i-th row or a record for a data.frame. The notation dataset[,j] refers to all elements in the j-th column- or a variable for a data.frame.

When dealing with large volumes of data, best practice is to extract a representative sample. A sample is representative if the analyst can make the same conclusions from it as from the entire population of data. There are no definite rules to determine the size of a sample. The sample of data must be large enough to contain significant information, yet small enough to manipulate quickly.

Textbook Examples sampling of population by Paul S. Levy and Stanley Lemeshow

Example Let us take data from Levy's book using the read.dta function from the foreign package:

In the R language, individual data sets are stored as data.frame objects, allowing users to load as many tables into working memory as necessary for the analysis. After loading the mydata table into memory, R functions can be run directly on this data table.

View a simple tabulation of your variable of interest. This table function accesses the momsag column (variable) stored inside the mydata data.frame object.

When you are working with a big dataset, you typically don't really need to work with all of it all the time. Particularly at the start of your project, while you are experimenting wildly with what you want to do, you can often iterate more quickly by working on a smaller subset of the data. Upon installing the package sparklyr, its command sdf_sample() provides a convenient way to do this. It takes a tibble, and the fraction of rows to return. In this case, you want to sample without replacement. To get a random sample of one tenth of your dataset, you would use the following code.

The pps package consists of several functions for selecting a sample from a finite population in such a way that the probability that a unit is selected is proportional to its size, hence the name pps (a more precise explanation of its acronym is "probability proportional to size"). The package also includes a function to select a stratified simple random sample (non pps) as well as a few utility functions.

The examples are illustrated using real data. The data set calif is based on data from the 2000 U.S. census, available at the U.S. Census Bureau’s web site, http://www.census.gov. The file includes one record for each place in California. A place can be anything from a tiny town to Los Angeles. The file includes a county code, and a stratum code has been added. The four strata were derived according to county size: counties with fewer than 50,000 people are in stratum 1 (as are the places in those counties), with 50,000 to 300,000 people in stratum 2, with 300,000 to 1,000,000 people in stratum 3, and the remaining ones are in stratum 4. The other variables in the file are total population (variable name population), and the number of whites (white), American Indians (amerind) and Hispanics (hispanic). A second data set, califcty, is derived from calif. It is the county-level summary of the bigger file, i.e., each record in califcty corresponds to a county. Throughout this document, we will use the total population as the measure of size of a unit (i.e., of a place or a county), where appropriate. The other variables (white, amerind, hispanic) play the role of the “variables of interest”, but are not used in any of the illustrations below.

The rest of this document assumes that the pps package and data have been loaded in an R session. If you installed the pps package the “R way”, then the functions and data can be loaded into your R session as follows.

We begin with a function that performs stratified simple random sampling. The function stratsrs takes a vector of stratum indicators and a vector of sample sizes (one sample size per stratum) as input. The stratum indicators must be sorted, or at least grouped. For example, the vectors 1,1,1,1,2,2,2,3,3,3,3,3,8,8,8,8 and 2,2,2,1,1,1,1,3,3,3,3,3,8,8,8,8 are both acceptable. These vectors correspond to four strata. Stratum 1 has four units, stratum 2 has three units, and so on. Note that there is no stratum 4, 5, 6 or 7, i.e., the stratum indicators are just labels. Thus they need not be numbers. An example of a vector of sample sizes is 2,3,1,2, which corresponds to selecting 2 units from stratum 1, 3 from stratum 2, 1 from stratum 3 and 2 from stratum 8. An example of the use of stratsrs is the following.

This returns the indices of the units in the sample. It is often convenient to keep all the population data, including stratification variables, in a data frame. If the population data are kept in the data frame pop, then a frame containing just the units in the sample is obtained using

where the data frame pop includes a stratum indicator variable called strat. This has the advantage that samplesubframe retains all the variables from the population data frame.

To illustrate a typical use of the stratified simple random sampling function with real data, here is an example using the California census data. We treat the data as a population of places, which have been placed in four strata, from which we would like to select a stratified random sample. For this example, we ignore the county indicator. First, let’s select the sample.

Now, let’s compare the population mean and sample mean by stratum for the first variable, which gives the total population for each place.

Better still, let’s get summaries for the four population variables for both the sample and the population.

The function ppswr selects a sample of units with replacement, with the probability of selection of each unit proportional to its size. The function call is ppswr(sizes,n), where sizes is a vector of the sizes of the population units and n is the sample size. We will use the California data to illustrate this function as well. For simplicity, we will select a sample of five units from the first county only.

The function ppss selects a sample of units using pps systematic sampling. Under this approach, a random starting point is chosen, and it determines the units that fall in the sample. It has the property that if a unit in the population is very large, then it is selected with certainty and can even be selected more than once. The function sizesok can be used to check for this situation. We will continue to use the single county used in the ppswr example to illustrate the ppss function.

To check whether any units are so big that they will be selected with certainty, we do the following.

Note that the function returns the number of “bad” units, which may be zero.

The function ppssstrat (note the three ses in a row) selects a pps sample in each stratum by simply applying the ppss function to each stratum in turn. The input file must be sorted by stratum (or at least grouped by stratum). To illustrate this function, we will use the county-level aggregation of the California data as our population. The 57 counties are grouped into four size strata. We will select 5, 4, 3 and 2 units in the four strata, respectively, with 5 corresponding to the sample size in the stratum of very small coutries, and so on.

The function sampford select a pps sample without replacement using Sampford’s method. The function sampfordpi computes the corresponding inclusion π_i and joint inclusion π_ij probabilities. To illustrate these functions, we will use the data for the first county only, as we did for ppswr. Unlike the previous functions, the Sampford ones incorporate a check for units that are too big and aborts if such units are found. First, we illustrate the use of sampford to select a sample.

The sampford function calls the ppswr and pps1 functions, as well as the sizesok function. The sampfordpi function returns a matrix with the inclusion probabilities π_i along the diagonal and the joint inclusion probabilities π_ij as off-diagonal elements. Of course, π_ij = π_ji.

Section 2: Data Partition

Data-mining applications deal with an abundance of data that requires simplification and analysis to make accurate predictions. However, the wealth of data can tempt the analyst the overfit the model. Model overfitting occurs when the analyst builds a model that does a great job of explaining the sample of data on which it is based, but fails to accurately predict outside the sample data.

The training set consists of the data used to build the candidate models.

The validation set is used to identify a "best" model through either comparison with other models or the turning of model parameters, then estimate the model accuracy.

In analyzing large data sets, a typical operation is to randomly partitioning the data set into subsets.

With caret package, when creating data partition 75% training and 25% test, we use:

inTrain would take approximately 75% rows from each species, which can be verified by issuing the following command:

Section 3: Accuracy Measures

There are different accuracy measures for methods classifying categorical outcomes than for methods estimating continuous outcomes.

Classification of Categorical Outcomes

There are many ways to measure how well a statistical model predicts a binary outcome. Three very common measures are accuracy, sensitivity, and specificity. Accuracy is one of those rare terms in statistics that means just what we think it does, but sensitivity and specificity are a little more complicated. To understand all three, first we have to consider the situation of predicting a binary outcome.

The basic situation is this: for each trial, there is only one true outcome: a Positive (or 0) or a Negative (or 1). For example, a bank assumes a Positive when there is a stolen credit card. That’s what the bank is on the lookout for. And of course we need the model to predict the outcome better than randomly guessing. Imagine if your credit card transactions were randomly declined for fraud. You would stop using the card.

So we need a little table, called a classification confusion matrix of all the possible situations:

Test Indicator

Outcome

	No	Yes
No	a True Negative	b False Positive
Yes	c False Negative	d True Positive

The Test Indicator is whatever process we’re using to predict whether each individual is a Yes or a No on the outcome we’re interested in. Does the bank’s model predict a thief? The Outcome is what actually eventually happens. Was the card really stolen?

For some decisions, those in box a, the model correctly predicted a no. All good. This is a true negative. The customer is using their own card and the bank believes it. The transaction goes through and the family happily snacks on dried mango on the drive home.

Box b is the count of those who were predicted to be Yeses but were actual Nos. Not so good. False positive. The customer isn’t happy that their real grocery shopping gets declined. Only some quick followup on the phone saves them.

Box c is the opposite of b. Those who were predicted to be Nos but were actual Yeses. This one is not so good either. False negative. The thief gets away with a load of risotto and dark chocolate covered almonds.

And finally some transactions were predicted to be Yeses and truly were Yeses. These individuals are all in box d. The thief is shut down. Justice is served.

Accuracy of Models

A perfectly accurate model would put every transaction into boxes a and d. Thieves are stopped but customers are not. A model that is so bad it’s worthless would have a lot of b’s (angry customers without groceries) and c’s (happy thieves with groceries) and possibly both.

One simple way of measuring Accuracy is simply the proportion of individuals who were correctly classified–the proportions of True Positives and True Negatives. This is helpful for sure, but sometimes it matters whether we’re correctly getting a Positive or a Negative correct. It may be worth annoying a few customers to make sure no thieves get away.

Another issue is we can generally increase one simply by decreasing the other. This may have important implications but the overall Accuracy rate won’t change. Or worse, we could improve overall Accuracy just by making the test more able to find the more common category.

So a better approach is to look at the accuracy for Positives and Negatives separately. These two values are called Sensitivity and Specificity. Sensitivity = d/(c+d): The proportion of observed positives that were predicted to be positive. In other words, of all the transactions that were truly fraudulent, what percentage did we find?

Specificity = a/(a+b): The proportion of observed negatives that were predicted to be negatives. In other words, of all the transactions that were legitimate, what percentage did we predict to be so? Ideally, the test will result in both being high, but usually there is a tradeoff. Every test needs to pick a threshold for how high a probability of fraud has to be before we call it a fraud.

Lowering that threshold to increase Sensitivity will decrease Specificity and vice versa. It’s important to understand this as you’re choosing that threshold and evaluating a model.

Tests for Binary Outcomes

Chi-square test: compares proportions between more than two groups

Relative risks: odds ratios or risk ratios (for 2x2 tables)

Logistic regression: multivariate technique used when outcome is binary; gives multivariate-adjusted odds ratios.

McNemar’s chi-square test: compares binary outcome between correlated groups (e.g., before and after)

Conditional logistic regression: multivariate regression technique for a binary outcome when groups are correlated (e.g., matched data)

GEE modeling: multivariate regression technique for a binary outcome when groups are correlated (e.g., repeated measures)

Fisher’s exact test: compares proportions between independent groups when there are sparse data (some cells <5).

McNemar’s exact test: compares proportions between correlated groups when there are sparse data (some cells <5).

Example: we will look at the “Elecsales” database in R.

Section 4: Logistic Regression

\[ \hat{y}_{t+1} = \alpha\,y_t + \left( 1- \alpha \right) \hat{y}_t = \hat{y}_t + \alpha \left( y_t - \hat{y}_t \right) , \] where

ŷ_t+1 = forecast of the time series for period t+1
y_t = actual value of the time series in period t
ŷ_t = forecast of the time series for period t
α = smoothing constant

We use the Holt--Winters Method as the most common method of exponential smoothing for data that experiences both seasonal and trend influences. For an example, we will look at our “beer” dataset from earlier.

Once we have created the HoltWinters object, we can now plot the object to view both the original data, and the smoothed forecast.

Section 5: k-Nearest Neighbors

The k-Nearest Neighbors (k-NN) method can be used either to classify an outcome category or estimate a continuous outcome.

An R TUTORIAL for Statistics Applications