An R TUTORIAL for Statistics Applications: Clustering

Clustering

The goal of clustering is to segment observations into similar groups based on the observed variables. Clustering can be employed during the data-preparation step to identify variables or observations that can be aggregated or removed from consideration.

Email Vladimir Dobrushkin

Clustering methods are used to identify groups of similar objects in a multivariate data sets collected from fields such as marketing, bio-medical, geo-spatial, and others. The exact definition of "similar" is variable among algorithms, but has a generic basis. The methods of forming clusters also vary, but follow a few general blueprints. They are different types of clustering methods, including:

Partitioning methods
Hierarchical clustering
Fuzzy clustering
Density-based clustering
Model-based clustering
Validation

R has a very good package, called "cluster" that contains many detailed examples:
cluster

Section 0: Distances for Similarity and Dissimilarity

R is a statistical analyzation language, and that means it is very good at data manipulation and data analyzation. One key way to analyze data is through plotting, and R excels in this field. R will take two vectors, x and y, of the same length, and will plot them against each other.

Similarity, Dissimilarity and Distance

Distance or similarity measures are essential to solve many pattern recognition problems such as classification, clustering, and retrieval problems. Various distance/similarity measures that are applicable to compare two probability density functions, pdf in short, are reviewed and categorized in both syntactic and semantic relationships. A correlation coefficient and a hierarchical clustering technique are adopted to reveal similarities among numerous distance/similarity measures.

From the scientific and mathematical point of view, distance is defined as a quantitative degree of how far apart two objects are. Synonyms for distance include dissimilarity. Those distance measures satisfying the metric properties are simply called metric while other non-metric distance measures are occasionally called divergence. Synonyms for similarity include proximity and similarity measures are often called similarity coefficients. The choice of distance/similarity measures depends on the measurement type or representation of objects. The probability density function, or pdf in short, is one of the most popular pattern representations.

Similarity is a characterization of the ratio of the number of attributes two objects share in common compared to the total list of attributes between them. Objects which have everything in common are identical, and have a similarity of 1.0. Objects which have nothing in common have a similarity of 0.0. There is a large number of similarity indices proposed and employed, but the concepts are common to all.

Dissimilarity is the complement of similarity, and is a characterization of the number of attributes two objects have uniquely compared to the total list of attributes between them. In general, dissimilarity can be calculated as 1 - similarity.

Distance is a geometric conception of the proximity of objects in a high dimensional space defined by measurements on the attributes. Remember that R calculates distances with the dist function, and uses "euclidean", "manhattan", or "binary" as the "metric." The vegan package provides vegdist(). Similar to the way in which these indices and metrics influenced ordination results, they similarly influence cluster analyses.

One can employ a broader range of distances or dissimilarity coefficients, including ones which ignore joint absences. R provides a function for calculating distances, the dist() function, which provides a fairly narrow range of distances (euclidean, manhattan, binary, canberra, and maximum). However, the vegan library provides the vegdist() function.

Table 1. Distances and Dissimilarities

Distance, Dissimilarity, or Index	dist	vegan	Formula
	method	method
Euclidean	X	X	$ \displaystyle \left( \sum_i \| x_i - y_i \|^2 \right)^{1/2} $
Manhattan	X	X	$ \displaystyle \sum_i \| x_i - y_i \| $
L_p			$ \displaystyle \left( \sum_i \| x_i - y_i \|^p \right)^{1/p} $
Chebyshev			$ \displaystyle \max_i \|x_i - y_i \| $
Sorensen			$ \displaystyle \dfrac{\sum_i \| x_i - y_i \|}{\sum_i ( x_i + y_i )} $
Canberra		X	$ \displaystyle \sum_i \,\dfrac{ \| x_i - y_i \|}{x_i + y_i } $
Bray-Curtis		bray
Gower		X	$ \displaystyle \frac{1}{N}\,\sum_i \|x_i - y_i \| $
Kulczynski		X	$ \displaystyle \dfrac{\sum_i \| x_i - y_i \|}{\sum_i \min ( x_i , y_i )} $
Soergel			$ \displaystyle \dfrac{\sum_i \| x_i - y_i \|}{\sum_i \max ( x_i , y_i )} $
Lorentzian			$ \displaystyle \sum_i \ln \left( 1 + \|x_i - y_i \| \right) $
Ochiai			$ \displaystyle \sqrt{\frac{a}{a+b} \, \frac{a}{a+c}} $
Jaccard			$ \displaystyle \frac{a}{a+b+c} = \frac{\sum_i x_i y_i}{\sum_i x_i^2 + \sum_i y_i^2 - \sum_i x_i y_i} $
Russell--Rao			$ \displaystyle \frac{a}{N} $
Kulczynski-2			$ \displaystyle \frac{1}{2} \left( \frac{a}{a+b} + \frac{a}{a+c} \right) $
cosine			$ \displaystyle \sum x_i y_i \left( \sum_i x_i^2 \,*\, \sum_i y_i^2 \right)^{-1/2} $
Pseudo cosine			$ \displaystyle \sum x_i y_i \left( \sum_i x_i \,*\, \sum_i y_i \right)^{-1} $
Dice			$ \displaystyle \frac{2\,\sum_i x_i y_i}{\sum_i x_i^2 + \sum_i y_i^2} $
Sorensen-Dice			$ \displaystyle \frac{2a}{2a+b+c} $
simple matching			$ \displaystyle \frac{a+d}{a+2b+2c+d} $
Yule			$ \displaystyle \frac{ad-bc}{ad+bc} $
Ruzicka			$ \displaystyle \frac{\sum_i \min (x_i , y_i )}{\sum_i \max (x_i , y_i )} $
roberts
Chi-Square
morisita		X
mountford		X
horn		X
minkowski	X

a = number of variables on which both objects i and j are 1
b = number of variables where object i is 1 and j is 0
c = number of variables where object i is 0 and j is 1
d = number of variables where both i and j are 0
a+b+c+d = N, the nubmer of variables.

This table specifies the distance, dissimilarity, or index functions used in various programs and libraries. The second line specifies the argument expected in the function call. The first column specifies the name of the index as commonly known; X indicates present under that name, another name indicates that the index is present in the function, but is called something slightly different. In all cases, the first argument to the function is the vegetation matrix (transformed appropriately before hand if desired), and the second argument is the distance, dissimilarity or index. Since R allows function arguments to be specified by order, as well as named, you don't have to remember whether the specific function uses "metric", "method", or "index" as long as it's the second argument supplied.

For example:

sampleA<-c(1,1,0,1,1,1,1)
sampleB<-c(0,1,1,0,0,1,1)
sorenson(sampleA,sampleB)

One of the oldest and best known occurrence measures is the Jaccard measure, also known as the Coefficient of Community (Jaccard 1901). The measure has seen extensive use, largely due to its simplicity and intuitiveness.

result <- dist(veg,method="canberra")     
result <- dist(veg,"canberra")

result <- vegdist(veg,method="bray")      
result <- vegdist(veg,"bray")

result <- dsvdis(veg,index="ruzicka")     
result <- dsvdis(veg,"ruzicka")

are all correct function calls for R and vegan. My convention is to name the result "dis" with an extension that specifies which distance, dissimilarity, or index I used. For example:

dis.euc <- dist(veg,'euclidean')
dis.bray <- vegdist(veg,method="bray")
dis.ruz <- dsvdis(veg,index="ruzicka")

R uses a function called cmdscale() to calculate what it calls "classical multi-dimensional scaling", a synonym for principal coordinates analysis.

Albeit the concept of Euclidean distance has prevailed in different cultures and regions for millennia, it is not a panacea for all types of data or pattern to be compared. For instance, it often does not work well with vegetation data. Many ecologists suggest Manhattan distance as an alternative. There have been considerable efforts in finding the appropriate measures among such a plethora of choices because it is of fundamental importance to pattern classification, clustering, and information retrieval problems.

dis.mht <- dist(veg,"manhattan")
mht.pco <- pco(dis.mht,k=10)
plot(mht.pco)

Section 1: Partitioning Methods

The goal of cluster analysis is to group observations into clusters such that observations within a cluster are similar and observations in different clusters are dissimilar. Therefore, to formalize this process, we need explicit measurements of similarity or, conversely, dissimilarity. Some metric track similarity between observations, and clustering method using such a metric would seek to maximize the similarity between observations. Other metrics measure dissimilarity, or distance, between observations, and a clustering method using one of these metrics would seek to minimize the distance between observations in a cluster.

When observations include numerical variables, Euclidean distance is the most common method to measure dissimilarity between observations. Let observations \[ {\bf u} = ( u_1 u_2 , \ldots , u_n )\] and \[ {\bf v} =(v_1 , v_2 , \ldots , v_n ) \] each comprise measurements of n variables. The Euclidean distance between u and v is \[ d_2 ({\bf u}, {\bf v}) = \| {\bf u} - {\bf v} \|_2 = \sqrt{(u_1 - v_1 )^2 + (u_2 - v_2 )^2 + \cdots + (u_n - v_n )^2} . \] The distance between two points in a grid based on a strictly horizontal and/or vertical path (that is, along the grid lines), as opposed to the diagonal (Euclidean) or "as the crow flies" distance is called the Manhattan distance or the taxicab distance: \[ d_1 ({\bf u}, {\bf v}) = \| {\bf u} - {\bf v} \|_1 = \sum_{k=1}^n |u_k - v_k | . \] Other dissimilarity measures exist such as correlation-based distances, which is widely used for gene expression data analyses. Correlation-based distance is defined by subtracting the correlation coefficient from 1. Different types of correlation methods can be used such as:

Pearson correlation distance: \[ d_{\scriptstyle cor} \left( {\bf x} , {\bf y} \right) = 1 - \dfrac{\sum_{i=1}^n \left( x_i - \overline{\bf x} \right)\left( y_i - \overline{\bf y} \right)}{\sqrt{\sum_{i=1}^n \left( x_i - \overline{\bf x} \right)^2 \sum_{k=1}^n \left( y_k - \overline{\bf y} \right)^2} } . \] Pearson correlation measures the degree of a linear relationship between two profiles.
Eisen cosine correlation distance (Eisen et al., 1998): \[ d_{\scriptstyle eisen} \left( {\bf x} , {\bf y} \right) = 1 - \dfrac{\left\vert \sum_{i=1}^n x_i y_i \right\vert}{\sqrt{\sum_{i=1}^n x_i^2 \sum_{k=1}^n y_k^2}} . \] It’s a special case of Pearson’s correlation with $ \overline{\bf x} $ and $ \overline{\bf y} $ both replaced by zero.
Spearman correlation distance: \[ d_{\scriptstyle spear} \left( {\bf x} , {\bf y} \right) = 1 - \dfrac{\sum_{i=1}^n \left( x_i - \overline{\bf x}' \right)\left( y_i - \overline{\bf y}' \right)}{\sqrt{\sum_{i=1}^n \left( x_i - \overline{\bf x}' \right)^2 \sum_{k=1}^n \left( y_k - \overline{\bf y}' \right)^2} } , \] where $ \overline{\bf x}' = \mbox{rank}({\bf x}) $ and $ \overline{\bf y}' = \mbox{rank}({\bf y}) . $ The spearman correlation method computes the correlation between the rank of x and the rank of y variables.
Kendall correlation distance: \[ d_{\scriptstyle kendall} \left( {\bf x} , {\bf y} \right) = 1 - \frac{n_c - n_d}{n(n-1)/2} , \] where
- n_c: total number of concordant pairs;
- n_d: total number of discordant pairs
- n: size of x and y
Kendall correlation method measures the correspondence between the ranking of x and y variables. The total number of possible pairings of x with y observations is n(n−1)/2, where n is the size of x and y. Begin by ordering the pairs by the x values. If x and y are correlated, then they would have the same relative rank orders. Now, for each y_i, count the number of $y_j >y_i $ (concordant pairs (c)) and the number of $ y_j < y_i $ (discordant pairs (d)).

Pearson correlation analysis is the most commonly used method. It is also known as a parametric correlation which depends on the distribution of the data. Kendall and Spearman correlations are non-parametric and they are used to perform rank-based correlation analysis.

What type of distance measures should we choose?

The choice of distance measures is very important, as it has a strong influence on the clustering results. For most common clustering software, the default distance measure is the Euclidean distance.

Depending on the type of the data and the researcher questions, other dissimilarity measures might be preferred. For example, correlation-based distance is often used in gene expression data analysis.

Correlation-based distance considers two objects to be similar if their features are highly correlated, even though the observed values may be far apart in terms of Euclidean distance. The distance between two objects is 0 when they are perfectly correlated. Pearson’s correlation is quite sensitive to outliers. This does not matter when clustering samples, because the correlation is over thousands of genes. When clustering genes, it is important to be aware of the possible impact of outliers. This can be mitigated by using Spearman’s correlation instead of Pearson’s correlation.

If we want to identify clusters of observations with the same overall profiles regardless of their magnitudes, then we should go with correlation-based distance as a dissimilarity measure. This is particularly the case in gene expression data analysis, where we might want to consider genes similar when they are “up” and “down” together. It is also the case, in marketing if we want to identify group of shoppers with the same preference in term of items, regardless of the volume of items they bought.

If Euclidean distance is chosen, then observations with high values of features will be clustered together. The same holds true for observations with low values of features.

Data Standardization

The value of distance measures is intimately related to the scale on which measurements are made. Therefore, variables are often scaled (i.e. standardized) before measuring the inter-observation dissimilarities. This is particularly recommended when variables are measured in different scales (e.g: kilograms, kilometers, centimeters, …); otherwise, the dissimilarity measures obtained will be severely affected.

The goal is to make the variables comparable. Generally variables are scaled to have i) standard deviation one and ii) mean zero.

The standardization of data is an approach widely used in the context of gene expression data analysis before clustering. We might also want to scale the data when the mean and/or the standard deviation of variables are largely different.

When scaling variables, the data can be transformed as follow: \[ \frac{x_i - \mbox{center}({\bf x})}{\mbox{scale} ({\bf x})} \] where center(x) can be the mean or the median of x values, and scale(x) can be the standard deviation (SD), the interquartile range, or the MAD (median absolute deviation).

The R base function scale() can be used to standardize the data. It takes a numeric matrix as an input and performs the scaling on the columns.

Standardization makes the four distance measure methods - Euclidean, Manhattan, Correlation and Eisen -- more similar than they would be with non-transformed data.

Note that, when the data are standardized, there is a functional relationship between the Pearson correlation coefficient r({x, y) and the Euclidean distance.

With some maths, the relationship can be defined as follow: \[ d_2 ({\bf x}, {\bf y}) = \sqrt{2m \left[ 1 - r({\bf x}, {\bf y}) \right]} \] where x and y are two standardized m-vectors with zero mean and unit length. Therefore, the result obtained with Pearson correlation measures and standardized Euclidean distances are comparable.

The conversion to z-scores also makes it easier to identify outlier measurements, which can distort the Euclidean or taxicab distance between observations. After converting to z-scores unequal weighting of variables can also be considered by multiplying the variables of each observation by a selected set of weights. For instance, after standardizing the units on customer observations so that income and age are expressed as their respective z-scores (instead of expressed in dollars and years), we can multiply the income z-scores by 2 is we wish to treat income with twice the importance of age. In other words, standardizing removes bias due to the difference in measurement units, and variable weighting allows the analyst to introduce appropriate bias based on the business context.

When clustering observations are done solely on the basis of categorial variables enclosed as 0-1 (or dummy variables), a better measure of similarity between two observations can be achieved by counting the number of variables with matching values. The simplest overlap measure is called the matching coefficient and is computed by \[ \frac{\mbox{number of variables with matching value for observations } {\bf u} \mbox{ and } {\bf v}}{\mbox{total number of variables}} \]

One weakness of the matching coefficient is that if two observations both have a 0 entry for categorical variable, this is counted as a sign of similarity between the two observations. However, matching 0 entries do not necessarily imply similarity. For example, if the categorical variable is Own A Minivan, then a 0 entry in two different observations does not mean that these two people own the same type of car; it means only that neither owns a minivan. To avoid misstanding similarity due to the absence of a feature, a similarity measure called Jaccard's coefficient does not count matching zero entries and is computed by \[ \frac{\mbox{number of variables with matching nonzero value for observations } {\bf u} \mbox{ and } {\bf v}}{\mbox{total number of variables} - (\mbox{number of variables with matching zero values for observations }{\bf u} \mbox{ and } {\bf v})} \]

Distance matrix computation

Data preparation

We start with the USArrests data as demo data sets. Later we will use random number generated data. For the USArrests data, we use only a subset of the data by taking 15 random rows among the 50 rows in the data set. This is done by using the function sample(). Next, we standardize the data using the function scale():

There are many R functions for computing distances between pairs of observations:

dist() R base function [stats package]: Accepts only numeric data as an input.
get_dist() function [factoextra package]: Accepts only numeric data as an input. Compared to the standard dist() function, it supports correlation-based distance measures including “pearson”, “kendall” and “spearman” methods.
daisy() function [cluster package]: Able to handle other variable types (e.g. nominal, ordinal, (a)symmetric binary). In that case, the Gower’s coefficient will be automatically used as the metric. It’s one of the most popular measures of proximity for mixed data types. For more details, read the R documentation of the daisy() function (?daisy).

To compute Euclidean distance, you can use the R base dist() function, as follow:

dist.eucl <- dist(df.scaled, method = "euclidean")

Note that, allowed values for the option method include one of: “euclidean”, “maximum”, “manhattan”, “canberra”, “binary”, “minkowski”. \[ \| x-y \|_{\infty} = \max_i |x_i - y_i | . \] To make it easier to see the distance information generated by the dist() function, you can reformat the distance vector into a matrix using the as.matrix() function.

In this matrix, the value represent the distance between objects. The values on the diagonal of the matrix represent the distance between objects and themselves (which are zero).

Computing correlation based distances

Correlation-based distances are commonly used in gene expression data analysis.

The function get_dist()[factoextra package] can be used to compute correlation-based distances. Correlation method can be either pearson, spearman or kendall.

Computing distances for mixed data

The function daisy() [cluster package] provides a solution (Gower’s metric) for computing the distance matrix, in the situation where the data contain no-numeric columns.

The R code below applies the daisy() function on flower data which contains factor, ordered and numeric variables:

Visualizing distance matrices

A simple solution for visualizing the distance matrices is to use the function fviz_dist() [factoextra package]. Other specialized methods, such as agglomerative hierarchical clustering or heatmap will be comprehensively described in the dedicated chapters.

To use fviz_dist() type this:

The color level is proportional to the value of the dissimilarity between observations: pure red if dist(xi,xj)=0 and pure blue corresponds to the highest value of euclidean distance computed. Objects belonging to the same cluster are displayed in consecutive order.

One can practice with many R commands using random generating data with the aid of rnorm(n,mean,var), which generates multivariate normal random variates based on: n --- number of datasets to be simulated, mean --- the mean of the dataset to be simulate, and var --- the variance covariance matrix. We give example of two sets of random data with different means.

We demonstrate usage of clara command and Silhouette plot.

One can directly extract Silhouette information from clustering with command silhouette (x, dist, dmatrix, …). Here x is an object of appropriate class; for the default method an integer vector with k different integer cluster codes or a list with such an x$clustering component. Note that silhouette statistics are only defined if $ 2 \le k \le n-1 . $

For each observation i, the silhouette width s(i) is defined as follows: Put a(i) = average dissimilarity between i and all other points of the cluster to which i belongs (if i is the only observation in its cluster, s(i):=0 without further calculations). For all other clusters C, put d(i,C) = average dissimilarity of i to all observations of C. The smallest of these d(i,C) is b(i):=minCd(i,C), and can be seen as the dissimilarity between i and its “neighbor” cluster, i.e., the nearest one to which it does not belong. Finally,

\[ s(i) := \frac{b(i) -a(i)}{\max \left\{ a(i) , b(i) \right\}} . \] silhouette.default() is now based on C code donated by Romain Francois (the R version being still available as cluster:::silhouette.default.R). Observations with a large s(i) (almost 1) are very well clustered, a small s(i) (around 0) means that the observation lies between two clusters, and observations with a negative s(i) are probably placed in the wrong cluster.

p>Function silhouette in package cluster can do the plots for you. It just needs a vector of cluster membership (produced from whatever algorithm you choose) and a dissimilarity matrix (probably best to use the same one used in producing the clusters). For example:

library (cluster)
library (vegan)
data(xclara)
dis = vegdist(xclara)
res = pam(dis,3) # or whatever your choice of clustering algorithm is
sil = silhouette (res$clustering,dis) # or use your cluster vector
windows() # RStudio sometimes does not display silhouette plots correctly
plot(sil)

EDIT: For k-means (which uses squared Euclidean distance)

library (vegan)
library (cluster)
data(xclara)
dis = dist(xclara)^2
res = kmeans(xclara,3)
sil = silhouette (res$cluster, dis)
windows() 
plot(sil)

Partitioning methods

In R, standard clustering methods (partitioning and hierarchical clustering) can be computed using the R packages stats and cluster. However the workflow, generally, requires multiple steps and multiple lines of R codes. Partitioning methods like pam, clara, and fanny require that the number of clusters be given by the user. It is recommended to utilize easy-to-use wrapper functions, such as the factoextra package for an enhanced cluster analysis and visualization. These functions include:

get_dist() & fviz_dist() for computing and visualizing distance matrix between rows of a data matrix. Compared to the standard dist() function, get_dist() supports correlation-based distance measures including “pearson”, “kendall,” and “spearman” methods.
eclust(): enhanced cluster analysis. It has several advantages:
- It simplifies the workflow of clustering analysis
- It can be used to compute hierarchical clustering and partititioning clustering in a single line function call
- Compared to the standard partitioning functions (kmeans, pam, clara and fanny) which requires the user to specify the optimal number of clusters, the function eclust() computes automatically the gap statistic for estimating the right number of clusters.
- For hierarchical clustering, correlation-based metric is allowed
- It provides silhouette information for all partitioning methods and hierarchical clustering
- It creates beautiful graphs using ggplot2

We start by presenting required R packages and data format for cluster analysis and visualization. To install packages, type in R:

install.packages("factoextra")
install.packages("cluster")
install.packages("magrittr")

To load packages:

library("cluster")
library("factoextra")

Data preparation
The built-in R dataset USArrests is used:

Output Distance matrix computation and visualization

Output

Enhanced clustering analysis

The standard R code for computing hierarchical clustering looks like this:

The format of the eclust() function [factoextra package] to simplify the workflow is as follow:

eclust(x, FUNcluster = "kmeans", hc_metric = "euclidean", ...)

In the following R code, we’ll show some examples for enhanced k-means clustering and hierarchical clustering. Note that the same analysis can be done for PAM, CLARA, FANNY, AGNES and DIANA.

It’s also possible to specify the number of clusters as follow:

eclust(df, "kmeans", k = 4)

Distance measures

It’s simple to compute and visualize distance matrix using the functions get_dist() and fviz_dist() [factoextra R package]:

get_dist(): for computing a distance matrix between the rows of a data matrix. Compared to the standard dist() function, it supports correlation-based distance measures including “pearson”, “kendall” and “spearman” methods.
fviz_dist(): for visualizing a distance matrix

Partitioning algorithms are clustering techniques that subdivide the data sets into a set of k groups, where k is the number of groups pre-specified by the analyst.

There are different types of partitioning clustering methods. The most popular is the K-means clustering (MacQueen 1967), in which, each cluster is represented by the center or means of the data points belonging to the cluster. The K-means method is sensitive to outliers.

An alternative to k-means clustering is the K-medoids clustering or PAM (Partitioning Around Medoids, Kaufman & Rousseeuw, 1990), which is less sensitive to outliers compared to k-means.
Read more: Partitioning Clustering methods.

The following R codes show how to determine the optimal number of clusters and how to compute k-means and PAM clustering in R.

Determining the optimal number of clusters: use factoextra::fviz_nbclust()

library("factoextra")
fviz_nbclust(my_data, kmeans, method = "gap_stat")

Suggested number of cluster: 3

Compute and visualize k-means clustering

Similarly, the k-medoids/pam clustering can be computed as follow:

K Means clustering

Example from https://www.r-bloggers.com/k-means-clustering-in-r/ #Implement K means clustering using kmeans() (stats package). According to the kmeans() help page, K means clustering "aims to partition the points into k groups such that the sum of squares from points to the assigned cluster centeres is minimized." centers referes to the number of k cluster centeres (chosen randomly); nstart refers to the random sets that are chosen to select the k cluster centers. We're just going to use 2 traits(Petal Length and Width) for easier visualisation.

#Recursive Partitioning: example from https://www.r-bloggers.com/trees-with-the-rpart-package/

#if you want to check out the results of cross validation or get more information about the different nodes

Section 2: Hierarchical Clustering

Hierarchical clustering is an alternative approach to partitioning clustering for identifying groups in the dataset. It does not require to pre-specify the number of clusters to be generated. In R, hierarchical methods like agnes, diana, and mona construct a hierarchy of clusterings, with the number of clusters ranging from one to the number of observations. The result of hierarchical clustering is a tree-based representation of the objects, which is also known as dendrogram. Observations can be subdivided into groups by cutting the dendrogram at a desired similarity level.

R code to compute and visualize hierarchical clustering:

 # Enhanced hierarchical clustering
 res.hc <- eclust(df, "hclust") # compute hclust

Hierarchical clustering [or hierarchical cluster analysis (HCA)] is an alternative approach to partitioning clustering for grouping objects based on their similarity. In contrast to partitioning clustering, hierarchical clustering does not require to pre-specify the number of clusters to be produced.

Hierarchical clustering can be subdivided into two types:

Agglomerative clustering in which, each observation is initially considered as a cluster of its own (leaf). Then, the most similar clusters are successively merged until there is just one single big cluster (root).
Divise clustering, an inverse of agglomerative clustering, begins with the root, in witch all objects are included in one cluster. Then the most heterogeneous clusters are successively divided until all observation are in their own cluster.

In agglomerative hierarchical cluster analysis, sample plots all start out as individuals, and the two plots most similar (or least dissimilar) are fused to form the first cluster. Subsequently, plots are continually fused one-by-one in order of highest similarity (or equivalently lowest dissimilarity) to the plot or cluster to which they are most similar. The hierarchy is determined by the cluster at a height characterized by the similarity at which the plots fused to form the cluster. Eventually, all plots are contained in the final cluster at similarity 0.0

The agglomerative clustering is the most common type of hierarchical clustering used to group objects in clusters based on their similarity. It’s also known as AGNES (Agglomerative Nesting). The algorithm starts by treating each object as a singleton cluster. Next, pairs of clusters are successively merged until all clusters have been merged into one big cluster containing all objects. The result is a tree-based representation of the objects, named dendrogram.

The dendrogram is a multilevel hierarchy where clusters at one level are joined together to form the clusters at the next levels. This makes it possible to decide the level at which to cut the tree for generating suitable groups of a data objects.

Algorithm

Cluster algorithms are classified by two characteristics: (1) hierarchical vs fixed-cluster, and (2) if hierarchical, agglomerative vs divisive.

Agglomerative cluster algorithms differ in the calculation of similarity when more than one plot is involved; i.e. when a plot is considered for merger with a cluster containing more than one plot. Of all the algorithms invented, we will limit our consideration to those available in R, which are also those most commonly used. In R there are multiple methods:

single --- nearest neighbor
complete --- furthest neighbor or compact
ward --- Ward's minimum variance method
mcquitty --- McQuitty's method
average --- average similarity
median --- median (as opposed to average) similarity
centroid --- geometric centroid
flexible --- flexible Beta

In the "connected" or "single" linkage, the similarity of a plot to a cluster is determined by the maximum similarity of the plot to any of the plots in the cluster. The name "single linkage" comes because the plot only needs to be similar to a single member of the cluster to join. Single linkage clusters can be long and branched in high-dimensional space, and are subject to a phenomenon called "chaining", where a single plot is continually added to the tail of the biggest cluster. On the other hand, single linkage clusters are topologically "clean."

In the average linkage, the similarity of a plot to a cluster is defined by the mean similarity of the plot to all the members of the cluster. In contrast to single linkage, a plot needs to be relatively similar to all the members of the cluster to join, rather than just one. Average linkage clusters tend to be relatively round or ellipsoid. Median or centroid approaches tend to give similar results.

In the "complete" linkage, or "compact" algorithm, the similarity of a plot to a cluster is calculated as the minimum similarity of the plot to any member of the cluster. Similar to the single linkage algorithm, the probability of a plot joining a cluster is determined by a single other member of a cluster, but now it is the least similar, not the most similar. Complete linkage clusters tend to be very tight and spherical, thus the alternative name "compact." The inverse of agglomerative clustering is divisive clustering, which is also known as DIANA (Divise Analysis) and it works in a “top-down” manner. It begins with the root, in which all objects are included in a single cluster. At each step of iteration, the most heterogeneous cluster is divided into two. The process is iterated until all objects are in their own cluster (see figure below). Note that, agglomerative clustering is good at identifying small clusters. Divisive clustering is good at identifying large clusters. In this article, we’ll focus mainly on agglomerative hierarchical clustering.

In R, we typically use the hclust() function to perform hierarchical cluster analysis. hclust() will calculate a cluster analysis from either a similarity or dissimilarity matrix, but plots better when working from a dissimilarity matrix. We can use any dissimilarity object from dist() or vegdist().

The hclust() function treats the dissimilarity object as the first argument, and the method or metric as the second explicit argument. For instance,

Data structure and preparation

The data should be a numeric matrix with:

rows representing observations (individuals);
and columns representing variables.

Here, we’ll use the R base USArrests data sets. Note that, it’s generally recommended to standardize variables in the data set before performing subsequent analysis. Standardization makes variables comparable, when they are measured in different scales. For example one variable can measure the height in meter and another variable can measure the weight in kg. The R function scale() can be used for standardization, See ?scale documentation.

In order to decide which objects/clusters should be combined or divided, we need methods for measuring the similarity between objects.

There are many methods to calculate the (dis)similarity information, including Euclidean and manhattan distances. In R software, you can use the function dist() to compute the distance between every pair of object in a data set. The results of this computation is known as a distance or dissimilarity matrix.

By default, the function dist() computes the Euclidean distance between objects; however, it’s possible to indicate other metrics using the argument method. See ?dist for more information.

For example, consider the R base data set USArrests, you can compute the distance matrix as follow:

Note that, the function () computes the distance between the rows of a data matrix using the specified distance measure method.

To see easily the distance information between objects, we reformat the results of the function dist() into a matrix using the as.matrix() function. In this matrix, value in the cell formed by the row i, the column j, represents the distance between object i and object j in the original data set. For instance, element 1,1 represents the distance between object 1 and itself (which is zero). Element 1,2 represents the distance between object 1 and object 2, and so on.

The R code below displays the first 6 rows and columns of the distance matrix:

Linkage

The linkage function takes the distance information, returned by the function dist(), and groups pairs of objects into clusters based on their similarity. Next, these newly formed clusters are linked to each other to create bigger clusters. This process is iterated until all the objects in the original data set are linked together in a hierarchical tree.

For example, given a distance matrix “res.dist” generated by the function dist(), the R base function hclust() can be used to create the hierarchical tree. hclust() can be used as follow:

res.hc <- hclust(d = res.dist, method = "ward.D2")

d: a dissimilarity structure as produced by the dist() function.
method: The agglomeration (linkage) method to be used for computing distance between clusters. Allowed values is one of “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”, “median” or “centroid”.

There are many cluster agglomeration methods (i.e, linkage methods). The most common linkage methods are described below.

Maximum or complete linkage: The distance between two clusters is defined as the maximum value of all pairwise distances between the elements in cluster 1 and the elements in cluster 2. It tends to produce more compact clusters.
Minimum or single linkage: The distance between two clusters is defined as the minimum value of all pairwise distances between the elements in cluster 1 and the elements in cluster 2. It tends to produce long, “loose” clusters.
Mean or average linkage: The distance between two clusters is defined as the average distance between the elements in cluster 1 and the elements in cluster 2.
Centroid linkage: The distance between two clusters is defined as the distance between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for cluster 2.
Ward’s minimum variance method: It minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged.
McQuitty method: When onsidering merging two clusters A and B, the dissimilarity of the resulting cluster AB to any other cluster C is calculated as \[ \frac{1}{2} \left[ (\mbox{dissimilarity between A and C}) + (\mbox{dissimilarity between B and C}) \right] . \] At each step, this method then merges the pair of clusters that results in the minimal increase in total dissimilarity.

Note that, at each stage of the clustering process the two clusters, that have the smallest linkage distance, are linked together. Complete linkage and Ward’s method are generally preferred.

Examples: Clustering in R

We start with the example of iris data to perform a few different types of clustering analysis. You can also use the datasets that are included in R. We are using the Fisher's iris data set, which gives measurements (cm) of 4 variables for 150 flowers across 3 species of iris.

Hierarchical Cluster Analysis

Dendextend cluster analysis vignette can be found on the official R web site.

First, create a distance matrix representing distances between individual flower observations using the dist function. The default distance measure is Euclidian; check the arguments if you want to use something else.

?dist
dist <- dist(iris2)

Next, perform hierarchical cluster analysis using the hclust function (base R). According to the help page for hclust: "Initially, each object is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster." A number of different clustering methods are available; default is complete clustering. The output of this object is an object that describes the tree produced by this process.

!Add one more example based on dendogram !

An R TUTORIAL for Statistics Applications