Cluster Analysis and K-Nearest Neighbours

Abstract

Cluster analysis is an exploratory analysis that identify the possible meaningful groups in the data. It is widely used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics and data compression [i]. K-nearest neighbours algorithm is a type of lazy learning, where the function is only approximately locally and all computation is deferred until classification [ii]. The purpose of this report is to analyze a dataset by applying the K-means clustering and the K-nearest neighbours algorithms. The goal of this assignment is to understand these two machine learning algorithms, the differences between them, and how to perform them by using R.

 

Data Understanding

The data set that I am using is from the SeaFlow environmental flow cytometry instrument, developed by the Armbrust Lab at the University of Washington. A flow cytometer delivers a flow of particles through capilliary. By shining lasers of different wavelengths and measuring the absorption and refraction patterns, the sizes of the particles and some information about its color and other properties can be determined.

The technology was developed for medical applications, where the particles were potential pathogens.

The dataset represents a 21 minute sample from the vessel in a file seaflow_21min.csv. It is a large dataset. I will analyze the first one thousands of records for the purpose of this exercise. The attributes of this dataset are as follows:

  • file_id: The data arrives in files, where each file represents a three-minute window; this field represents which file the data came from.
  • time: The time the particle passed through the instrument.
  • cell_id: A unique identifier for each cell within a file.
  • d1,d2: Intensity of light at the two main sensors, oriented perpendicularly. Used primarily in preprocesssing; they are unlikely to be useful for classification.
  • fsc_small, fsc_perp, fsc_big: Forward scatter small, perpendicular, and big. These values help distinguish different sizes of particles.
  • pe: A measurement of phycoerythrin fluorescence, which is related to the wavelength associated with an orange colour in microorganisms.
  • chl_small, chl_big: Measurements related to the wavelength of light corresponding to chlorophyll.
  • pop: This is the class label. There are particles that cannot be unambiguously classified, 100% of accuracy should not be expected. The values of this attribute are nano, pico, synecho and ultra.

# read and understand the data
seaflow = read.csv( “seaflow_21min_1000.csv”, header=TRUE )
str(seaflow)

## ‘data.frame’:    1000 obs. of  12 variables:
##  $ file_id      : int  203 203 203 203 203 203 203 203 203 203 …
##  $ time         : int  12 12 12 12 12 12 12 12 12 12 …
##  $ cell_id     : int  1 4 6 9 11 15 17 23 24 26 …
##  $ d1            : int  25344 12960 21424 7712 30368 30032 28704 17008 4896 313
##  $ d2            : int  27968 22144 23008 14528 21440 22704 21664 7072 13104 22
##  $ fsc_small: int  34677 37275 31725 28744 28861 31221 37539 38987 25515
##  $ fsc_perp : int  14944 20440 11253 10219 6101 13488 17944 20315 5995 218
##  $ fsc_big    : int  32400 32400 32384 32416 32400 32400 32400 32416 32400
##  $ pe            : int  2216 1795 1901 1248 12989 1883 2107 1509 1952 2525 …
##  $ chl_small: int  28237 36755 26640 35392 23421 27323 34627 36680 29621
##  $ chl_big   : int  5072 14224 0 10704 5920 6560 11072 15072 3040 2336 …
##  $ pop         : Factor w/ 4 levels “nano”,”pico”,..: 2 4 2 4 3 2 4 4 2 2 …

summary(seaflow)

##     file_id                    time                  cell_id                    d1
##  Min.     :203   Min.      :12.00      Min.   :   1.0         Min.   : 2976
##  1st Qu. :203   1st Qu.  :14.00     1st Qu.: 704.8      1st Qu.: 7756
##  Median:203   Median:16.00     Median :1338.0   Median :17696
##  Mean   :203   Mean    :16.32     Mean   :1342.0     Mean   :17241
##  3rd Qu.:203   3rd Qu.:18.00     3rd Qu.:1989.2     3rd Qu.:24292
##  Max.     :203   Max.    :20.00     Max.   :2635.0       Max.   :47824
##        d2                      fsc_small                 fsc_perp          fsc_big
##  Min.      :  112     Min.      :10109   Min.     : 0           Min.     :32384
##  1st Qu.  : 9420   1st Qu.  :30978   1st Qu. :12915   1st Qu. :32400
##  Median :18472  Median:35206   Median:17610   Median:32400
##  Mean    :17192   Mean   :34638   Mean    :17124   Mean   :32405
##  3rd Qu. :24036  3rd Qu.:39066    3rd Qu. :21996   3rd Qu.:32416
##  Max.      :53056  Max.    :63640    Max.      :63456   Max.     :32432
##        pe                       chl_small             chl_big               pop
##  Min.      :    0       Min.      : 4648    Min.      :    0       nano      :171
##  1st Qu.  : 1578   1st Qu.  :24058   1st Qu.  : 3704   pico        :340
##  Median : 2256   Median:31192   Median : 8120   synecho:192
##  Mean    : 4458   Mean    :30682   Mean    : 8514   ultra       :297
##  3rd Qu.: 3834   3rd Qu. :38251   3rd Qu. :12596
##  Max.    :29317   Max.     :62541   Max.      :44688

# check if there is any missing values
sum(is.na(seaflow))

## [1] 0

The preliminary examination of the dataset with R provides us the following information.

  • The dataset consists of 1000 observations and 12 variables.
  • All variables are numerical except the class label ‘pop’ is a categorical.
  • No missing value in the data set.

Constructing scatterplots with the ggvis package.

library(ggvis)

# Scatter plots to get initial overview of the data Set
seaflow %>% ggvis(~fsc_small, ~fsc_perp, fill = ~pop) %>% layer_points()

2. fsc_small vs fsc_perp

seaflow %>% ggvis(~chl_small, ~chl_big, fill = ~pop) %>% layer_points()

4. chl_small vs chl_big

seaflow %>% ggvis(~chl_small, ~pe, fill = ~pop) %>% layer_points()

5.chl_small vs pe

seaflow %>% ggvis(~chl_big, ~pe, fill = ~pop) %>% layer_points()

6. chl_big vs pe

K-Means Clustering

K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells [iii].

The following are the K-means cluster analysis of the dataset.

library(“NbClust”)

# a function to plot the total within-groups sums of squares
# against the number of clusters in a K-means
wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type=”b”, xlab=”Number of Clusters”,
ylab=”Within groups sum of squares”)}

# remove column ‘pop’ as it is the class label
df <- seaflow[-12]

# remove columns file_id, time, cell_id, d1, d2, as they are not relevant
df <- df[-1:-5]
head(df)

##   fsc_small fsc_perp fsc_big      pe    chl_small  chl_big
## 1     34677    14944   32400      2216     28237        5072
## 2     37275    20440   32400      1795     36755      14224
## 3     31725    11253   32384      1901     26640           0
## 4     28744    10219   32416      1248     35392      10704
## 5     28861      6101   32400    12989     23421        5920
## 6     31221    13488   32400      1883     27323        6560

# standardize/normalize data
df <- scale(df)
head(df)

##              fsc_small    fsc_perp         fsc_big            pe            chl_small    chl_big
## [1,]  0.005217565  -0.2886005  -0.5177936  -0.4120231  -0.2481986  -0.5633575
## [2,]  0.356434948   0.4388824  -0.5177936  -0.4893935   0.6165285   0.9345090
## [3,] -0.393856227  -0.7771630  -2.3156879  -0.4699130  -0.4103222  -1.3934689
## [4,] -0.796850461  -0.9140294   1.2801007  -0.5899198   0.4781600   0.3584065
## [5,] -0.781033512  -1.4591121  -0.5177936   1.5678136  -0.7371074  -0.4245692
## [6,] -0.461990777  -0.4813253  -0.5177936  -0.4732210  -0.3409857  -0.3198233

Data normalization is one of the preprocessing procedures in machine learning. Since the variables vary in range, they are scaled so as to fall within a small specified range, to prevent one variable overpowering the other variable [iv].

# determine number of clusters
# A bend in the graph can suggest the appropriate number of clusters
wssplot(df)

1. number of clusters

The above plot indicates that there is a distinct drop in within groups sum of squares from 1 to 2 clusters. After 2 clusters the drops cease down, suggesting that a 2-cluster solution may be a good fit to the data.

# set.seed() function to guarantee that the results are reproducible
set.seed(1234)

# create the clusters
nc <- NbClust(df, min.nc=2, max.nc=15, method=”kmeans”)

2. Hubert

##***: The Hubert index is a graphical method of determining the number of ##clusters. In the plot of Hubert index, we seek a significant knee that ##corresponds to a significant increase of the value of the measure i.e the ##significant peak in Hubert index second differences plot.

3. Dindex

##***: The D index is a graphical method of determining the number of clusters. In the plot ##of D index, we seek a significant knee (the significant peak in Dindex second differences ##plot) that corresponds to a significant increase of the value of the measure.
## *****************************************************************************
## * Among all indices:
## * 6 proposed 2 as the best number of clusters
## * 4 proposed 3 as the best number of clusters
## * 4 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 2 proposed 7 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
## * 1 proposed 13 as the best number of clusters
## * 2 proposed 15 as the best number of clusters
##
##                    ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is  2
##
## *******************************************************************

table(nc$Best.n[1,])

##  0  1  2  3  4  5  6  7  9 13 15
##  2  1  6  4  4  1  2  2  1  1  2

# create a bar plot
barplot(table(nc$Best.n[1,]),
xlab=”Number of Clusters”, ylab=”Number of Criteria”,
main=”Number of Clusters Chosen by 26 Criteria”)

Number of Clusters 26 Criteria

The above bar chart shows that NbClust package suggest a 2-cluster solution.

set.seed(1234)

# k-means cluster analysis
fit.km2 <- kmeans(df, 2, nstart=25)
fit.km2$size

## [1] 484 516

fit.km2$centers

##       fsc_small    fsc_perp        fsc_big             pe           chl_small      chl_big
## 1  0.6702573   0.6486115   0.1285569   -0.3841432   0.7726003    0.7313041
## 2 -0.6286910  -0.6083876  -0.1205844   0.3603203   -0.7246871  -0.6859519

# remove columns pop, file_id, time, cell_id, d1, d2
new_seaflow <- seaflow[-12]
new_seaflow <- new_seaflow[-1:-5]

# aggregate to determine variable means for each cluster in original metric
aggregate(new_seaflow, by=list(cluster=fit.km2$cluster), mean)

##   cluster  fsc_small    fsc_perp      fsc_big         pe       chl_small    chl_big
## 1       1      39596.38   22024.46   32405.75  2367.705   38292.39  12982.413
## 2       2      29987.90   12528.07   32403.53  6418.597   23543.35    4322.946

# A cross-tabulation of pop and cluster membership
ct.km2 <- table(seaflow$pop, fit.km2$cluster)
ct.km2
##                         1      2
##   nano        171      0
##   pico           35   305
##   synecho      3   189
##   ultra        275     22

We compare the results of the cluster memberships and the class label ‘pop’ in the cross tabulation. The results suggest a different number of groups.

library(flexclust)

# The adjusted Rand index – measure the agreement between two partitions
# It ranges from -1 (no agreement) to 1 (perfect agreement)
randIndex(ct.km2)

##       ARI
## 0.3998745

We measure the agreement between ‘pop’ the class label and cluster, using the adjusted rank index provided by flexclust package. The result of randIndex shows that the agreement between ‘pop’ and the cluster solution is 0.40. This is far from a prefect agreement.

Let’s try different K values.

# try k=3
fit.km3 <- kmeans(df, 3, nstart=25)
ct.km3 <- table(seaflow$pop, fit.km3$cluster)
ct.km3

##                          1      2       3
##   nano         171      0       0
##   pico            30   309       1
##   synecho      1     57   134
##   ultra        257     40       0

randIndex(ct.km3)

##       ARI
## 0.4786957

# try k=4
fit.km4 <- kmeans(df, 4, nstart=25)
ct.km4 <- table(seaflow$pop, fit.km4$cluster)
ct.km4
##                         1      2     3      4
##   nano        168      3     0      0
##   pico              0  236    0  104
##   synecho      1      0 117    74
##   ultra       106   191     0     0

randIndex(ct.km4)

##       ARI
## 0.3835063

The result of the randIndex for a 3-cluster suggest that it is a better solution as compared to a 2-cluster that recommended by the NbClust package.

 

K-Nearest Neighbours

The k-Nearest Neighbours algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbours, with the object being assigned to the class most common among its k-nearest neighbours (k is a positive integer, typically small) [v].

The following are the k-NN analysis of the dataset.

# read the data
seaflow_data = read.csv( “seaflow_21min_1000.csv”, header=TRUE )

# check the pop attribute to see the division
table(seaflow_data$pop)

##
##    nano    pico  synecho   ultra
##      171       340         192     297

# KNN algorithm
library(class)

# normalization
normalize <- function(x) {
num <- x – min(x)
denom <- max(x) – min(x)
return (num/denom)
}

seaflow_norm <- as.data.frame(lapply(seaflow_data[6:11], normalize))

summary(seaflow_norm)

##        fsc_small             fsc_perp                 fsc_big                    pe
##  Min.     :0.0000   Min.      :0.0000    Min.     :0.0000   Min.     :0.00000
##  1st Qu. :0.3899   1st Qu.  :0.2035   1st Qu. :0.3333   1st Qu.  :0.05383
##  Median:0.4688   Median:0.2775   Median:0.3333   Median:0.07695
##  Mean   :0.4582   Mean    :0.2699   Mean    :0.4293   Mean   :0.15206
##  3rd Qu.:0.5409   3rd Qu. :0.3466   3rd Qu.:0.6667   3rd Qu.:0.13078
##  Max.     :1.0000   Max.     :1.0000   Max.     :1.0000   Max.    :1.00000
##      chl_small              chl_big
##  Min.     :0.0000    Min.     :0.00000
##  1st Qu. :0.3353   1st Qu.  :0.08289
##  Median:0.4585   Median:0.18170
##  Mean   :0.4497   Mean    :0.19052
##  3rd Qu.:0.5804   3rd Qu.:0.28187
##  Max.    :1.0000   Max.     :1.00000

# create training and test sets
set.seed(1234)

# shuffle dataset and have same ratio between target variable in training and test sets
ind <- sample(2, nrow(seaflow_data), replace=TRUE, prob=c(0.67, 0.33))

# define training and test sets
seaflow.training <- seaflow_data[ind==1, 6:11]
seaflow.test <- seaflow_data[ind==2, 6:11]

# define the target variable
seaflow.trainLabels <- seaflow_data[ind==1, 12]
seaflow.testLabels <- seaflow_data[ind==2, 12]

# Build the KNN Model
# Try k=3
seaflow_pred_k3 <- knn(train = seaflow.training, test = seaflow.test, cl = seaflow.trainLabels, k=3)

# evaluate the model performance
library(gmodels)

# cross tabulation or a contingency table
CrossTable(x = seaflow.testLabels, y = seaflow_pred_k3, prop.chisq=FALSE)

##
##    Cell Contents
## |——————————|
## |                                 N |
## |           N / Row Total |
## |             N / Col Total |
## |         N / Table Total |
## |—————————–|
##
## Total Observations in Table:  326
##
##                                   | seaflow_pred_k3
## seaflow.testLabels | nano |   pico | synecho |  ultra |Row Total|
## ————————-|———-|———-|————–|———-|————–|
##                        nano |       47 |         0 |             0 |         2 |            49 |
##                                  |  0.959 |  0.000 |     0.000 |  0.041 |       0.150 |
##                                  |  1.000 |  0.000 |     0.000 |  0.022 |                 |
##                                  |  0.144 |  0.000 |     0.000 |  0.006 |                 |
##————————-|———–|———-|————–|———-|————-|
##                         pico |         0 |     118 |             0 |          2 |         120 |
##                                  | 0.000 |  0.983 |      0.000 |   0.017 |     0.368 |
##                                  | 0.000 |  0.944 |      0.000 |   0.022 |                |
##                                  | 0.000 |  0.362 |      0.000 |   0.006 |                |
## ————————-|———|———–|————-|————|————|
##                  synecho |         0 |         2 |           62 |          0 |           64 |
##                                  | 0.000 |  0.031 |      0.969 |   0.000 |     0.196 |
##                                  | 0.000 |  0.016 |      1.000 |   0.000 |                |
##                                  | 0.000 |  0.006 |      0.190 |   0.000 |                |
## ————————-|———|———–|————-|————|————|
##                        ultra |        0 |          5 |             0 |         88 |          93 |
##                                  | 0.000 |   0.054 |     0.000 |    0.946 |     0.285 |
##                                  | 0.000 |   0.040 |     0.000 |    0.957 |                |
##                                  | 0.000 |   0.015 |     0.000 |    0.270 |                |
## ————————-|———|———–|————–|————|————|
##         Column Total |      47 |     125 |           62 |         92 |         326 |
##                                  | 0.144 |  0.383 |      0.190 |    0.282 |                |
## ————————-|———-|———|————–|————|————-|

ct.k3 <- table(seaflow.testLabels, seaflow_pred_k3)

library(flexclust)

# Adjusted Rand index
randIndex(ct.k3)

##       ARI
## 0.9020429

# Testing different values of k
# let’s try k=5
seaflow_pred_k5 <- knn(train = seaflow.training, test = seaflow.test, cl = seaflow.trainLabels, k=5)
ct.k5 <- table(seaflow.testLabels, seaflow_pred_k5)
randIndex(ct.k5)

##       ARI
## 0.8825074

# let’s try k=7
seaflow_pred_k7 <- knn(train = seaflow.training, test = seaflow.test, cl = seaflow.trainLabels, k=7)
ct.k7 <- table(seaflow.testLabels, seaflow_pred_k7)
randIndex(ct.k7)

##       ARI
## 0.8745358

# let’s try k=9
seaflow_pred_k9 <- knn(train = seaflow.training, test = seaflow.test, cl = seaflow.trainLabels, k=9)
ct.k9 <- table(seaflow.testLabels, seaflow_pred_k9)
randIndex(ct.k9)

##       ARI
## 0.8540793

# let’s try k=13
seaflow_pred_k13 <- knn(train = seaflow.training, test = seaflow.test, cl = seaflow.trainLabels, k=13)
ct.k13 <- table(seaflow.testLabels, seaflow_pred_k13)
randIndex(ct.k13)

##       ARI
## 0.8760217

# let’s try k=15
seaflow_pred_k15 <- knn(train = seaflow.training, test = seaflow.test, cl = seaflow.trainLabels, k=15)
ct.k15 <- table(seaflow.testLabels, seaflow_pred_k15)
randIndex(ct.k15)

##       ARI
## 0.8616283

The above cross tables and randIndex results suggest that k with a value of 3 give the best performance in term of the model’s predictions.

 

Conclusion

K-Means and KNN are two commonly used machine learning algorithms that use distance as a metric. Both algorithms require the user to determine the K-value when building the model, however each of the K-value is referring to a completely different thing. The K-value in K-Means is defined by the number of clusters in the data. While for KNN, K-value is described by the number of nearest neighbour chosen.

K-Means is an unsupervised clustering algorithm, learning from the unlabeled data and trying to partition a set of points into K clusters, such that the points in each cluster are near to each other. KNN is a supervised classification algorithm, learning from the labeled data and trying to classify a point based on the known classification of other points by using the K nearest points.

 

References

[i] https://en.wikipedia.org/wiki/Cluster_analysis

[ii] https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

[iii] https://en.wikipedia.org/wiki/K-means_clustering

[iv] http://www.medwelljournals.com/fulltext/?doi=ijscomp.2009.168.172

[v] https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

Leave a Reply

Your email address will not be published. Required fields are marked *