Data Analytics in Healthcare

I have been reading the use of Data Analytics in healthcare services as part of my preparation for future job possibilities. I am sharing my findings here with as they may be useful and insightful.

Healthcare IBM

Infographic by IBM: Healthcare is challenged by large amounts of data in motion that is diverse, unstructured and growing exponentially. Data constantly streams in through interconnected sensors, monitors and instruments in real-time faster than a physician or nurse can keep up [i].

 

The increasing digitization of healthcare data means that organizations often add terabytes’ worth of patient records to data centers annually. At the moment, much of that unstructured data sits unused, having been retained largely (if not solely) for regulatory purposes. It isn’t easy, namely because the demand for healthcare IT skills far outpaces the supply of workers able to fill job openings, but a better grasp of that data means knowing more about individual patients as well as large groups of them and knowing how to use that information to provide better, more efficient and less expensive care.

The following are the benefits of implementing data analytics in healthcare services:

Reduce administrative costs

The challenge of reducing administrative costs is one of the biggest for any industry. About a quarter of healthcare budget expenses is used for this purpose. Data analytics allows hospital systems to better utilize and exchange the information they already have by making sure their medical codes are properly used, and thus, the correct reimbursements are received. With Electronic medical records and automated coding transactions helps hospitals better manage the cash flow thus decreasing administrative costs.

As a tool for clinical decision support

With the use of data analytics, patients’ information such as medical tests, lab reports and prescribed medications could be displayed on one electronic dashboard. This provide a dual benefit of saving time improve the quality of healthcare services, while at the same time cutting costs for the organization. Issues such as double-prescribing patients’ medications due to the lack of information can be greatly reduced.

Reducing fraud and abuse

A significant amount of money lost in the healthcare industry due to fraud and abuse. Analytics can help organizations track fraudulent and incorrect payments by looking at the history of an individual patient.

Better care coordination

Hospital systems can be very disjointed due to its size and spread. Sharing of information between healthcare organizations and within the same systems different specialties was always a challenge. Analytics will hopefully improve upon this as they become more wide used.

Improved Information sharing with patients

Analytics can help healthcare organizations track and remind patients to keep a healthy lifestyle. Information such as ways a certain patient can modify his or her lifestyle to suit their medical conditions can make a difference in the health of the patient.

 

These are some key benefits a healthcare organization can gain from Data Analytics. This list is not a definitive list of benefits, as I am sure there definitely are more benefits that can be obtained.

 

 

Reference

[i] http://www-935.ibm.com/services/us/gbs/thoughtleadership/healthcare-ecosystem/

http://searchhealthit.techtarget.com/report/Analytics-Moving-health-care-forward

http://www.healthcareitnews.com/news/5-ways-hospitals-can-use-data-analytics

http://hissjournal.biomedcentral.com/articles/10.1186/2047-2501-2-3

 

R – ggplot2

The recent news reported that there has been a remarkable turnaround in the Irish housing market. House prices in Dublin were rising at a rate of over 20% a year. I decided to create a line chart to see how the housing prices were for the past 15 years. My first step was going to the Environment, Community and Local Government website to retrieve the relevant data.

I downloaded the document, extract and cleanse the data that I wanted and saved it as a csv file named “houseprice.csv”. I opened the csv file in R with the syntax:

houseprice <- read.csv(“houseprice.csv”, header=TRUE, skip=2)

I have removed the first 2 lines of the file as there were a title and an empty row.

I installed the gglot2 package with the syntax: install.packages(‘ggplot2’,dep=TRUE) and then loaded the ggplots package with the syntax: library(ggplot2).

Now I started plotting my line chart. The syntax to get the basic line graph by using ggplot2 and the result as below:

ggplot(data=houseprice, aes(x=year, y=Prices, group=Category)) + geom_line()

 

Plot1

I then changed the line colour and the line thickness, and also labelled the title, x and y axis. This was the syntax:

ggplot(data=houseprice, aes(x=year, y=Prices, group=Category, colour=Category)) + geom_line(size=1.5) +
ggtitle(“Average House & Apartment Prices”) +
labs(x=”Year”, y=”Prices in Euro”)

Plot2

I continued to format the chart by changing the scales for x-axis and y-axis. I wanted x-axis to display every single year instead every five years. I changed the y-axis so that the axis unit is 20,000 instead of 50,000. I also edited the font to different sizes for the plot title, legend labels, axis titles and axis labels. I bolded the plot title and axis title and changed the colour of the axis texts to black to make it more visible. I have removed the legend title as it was not required. I moved the position of the legend inside the graph.

This was the syntax:

ggplot(data=houseprice, aes(x=year, y=Prices, group=Category, colour=Category)) + geom_line(size=1.5) +
ggtitle(“Average House & Apartment Prices”) +
labs(x=”Year”, y=”Prices in Euro”) +
scale_x_continuous(breaks=seq(2000,2014,1)) +
scale_y_continuous(breaks=seq(140000,420000,20000)) +
theme(plot.title = element_text(face=”bold”, size=11)) +
theme(axis.title = element_text(face=”bold”, size=9)) +
theme(axis.text.x = element_text(colour=”black”, size=7)) +
theme(axis.text.y = element_text(colour=”black”, size=7)) +
theme(legend.title = element_blank()) +
theme(legend.text = element_text(size=8)) +
theme(legend.position=c(.9, .85))

This was the final result of my line graph.

Plot3

From the line graph, we could quickly see that Dublin has the highest house and apartment prices as compared to Cork and the whole country. House prices in Ireland were at their peak in year 2007.

In general, house prices had a positive constant growth from year 2000 till year 2007. It had decreased dramatically within 2 years from year 2007 till year 2009. It had continued to drop till year 2012. From year 2012, the house prices had shown the sign of recovery for the whole nation.

Dublin house prices had recovered in year 2010, 2 years earlier as compared to the whole nation. However the price growth did not sustain and it had dropped after 1 year. From year 2012 onward, Dublin house prices had a positive increased due to the on-going recovery.

It would be interesting to compare the average house and apartment prices with the domestic economy in Ireland from year 2000 to year 2014. We could study if there is any correlation between economy and house prices, what are the relationships between these two variables. By building up a linear regression modeling, we can learn how the economy can affect the house prices and make a prediction.

Besides that, we could look at the dataset of the numbers of completed new houses versus the average house prices in Ireland for the above duration. By using visualization, we could easily study how supply and demand of houses will affect the movement of the house prices. We could analyze developments in the housing market and predict the future investment opportunities in the property market.

We also could examine the raw building materials prices and labour costs. Check if there is any relationship between these elements against house prices in Ireland. Investigate how these elements can affect the house prices and do these elements have a strong positive correlation toward house prices.

Cluster Analysis and K-Nearest Neighbours

Abstract

Cluster analysis is an exploratory analysis that identify the possible meaningful groups in the data. It is widely used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics and data compression [i]. K-nearest neighbours algorithm is a type of lazy learning, where the function is only approximately locally and all computation is deferred until classification [ii]. The purpose of this report is to analyze a dataset by applying the K-means clustering and the K-nearest neighbours algorithms. The goal of this assignment is to understand these two machine learning algorithms, the differences between them, and how to perform them by using R.

 

Data Understanding

The data set that I am using is from the SeaFlow environmental flow cytometry instrument, developed by the Armbrust Lab at the University of Washington. A flow cytometer delivers a flow of particles through capilliary. By shining lasers of different wavelengths and measuring the absorption and refraction patterns, the sizes of the particles and some information about its color and other properties can be determined.

The technology was developed for medical applications, where the particles were potential pathogens.

The dataset represents a 21 minute sample from the vessel in a file seaflow_21min.csv. It is a large dataset. I will analyze the first one thousands of records for the purpose of this exercise. The attributes of this dataset are as follows:

  • file_id: The data arrives in files, where each file represents a three-minute window; this field represents which file the data came from.
  • time: The time the particle passed through the instrument.
  • cell_id: A unique identifier for each cell within a file.
  • d1,d2: Intensity of light at the two main sensors, oriented perpendicularly. Used primarily in preprocesssing; they are unlikely to be useful for classification.
  • fsc_small, fsc_perp, fsc_big: Forward scatter small, perpendicular, and big. These values help distinguish different sizes of particles.
  • pe: A measurement of phycoerythrin fluorescence, which is related to the wavelength associated with an orange colour in microorganisms.
  • chl_small, chl_big: Measurements related to the wavelength of light corresponding to chlorophyll.
  • pop: This is the class label. There are particles that cannot be unambiguously classified, 100% of accuracy should not be expected. The values of this attribute are nano, pico, synecho and ultra.

# read and understand the data
seaflow = read.csv( “seaflow_21min_1000.csv”, header=TRUE )
str(seaflow)

## ‘data.frame’:    1000 obs. of  12 variables:
##  $ file_id      : int  203 203 203 203 203 203 203 203 203 203 …
##  $ time         : int  12 12 12 12 12 12 12 12 12 12 …
##  $ cell_id     : int  1 4 6 9 11 15 17 23 24 26 …
##  $ d1            : int  25344 12960 21424 7712 30368 30032 28704 17008 4896 313
##  $ d2            : int  27968 22144 23008 14528 21440 22704 21664 7072 13104 22
##  $ fsc_small: int  34677 37275 31725 28744 28861 31221 37539 38987 25515
##  $ fsc_perp : int  14944 20440 11253 10219 6101 13488 17944 20315 5995 218
##  $ fsc_big    : int  32400 32400 32384 32416 32400 32400 32400 32416 32400
##  $ pe            : int  2216 1795 1901 1248 12989 1883 2107 1509 1952 2525 …
##  $ chl_small: int  28237 36755 26640 35392 23421 27323 34627 36680 29621
##  $ chl_big   : int  5072 14224 0 10704 5920 6560 11072 15072 3040 2336 …
##  $ pop         : Factor w/ 4 levels “nano”,”pico”,..: 2 4 2 4 3 2 4 4 2 2 …

summary(seaflow)

##     file_id                    time                  cell_id                    d1
##  Min.     :203   Min.      :12.00      Min.   :   1.0         Min.   : 2976
##  1st Qu. :203   1st Qu.  :14.00     1st Qu.: 704.8      1st Qu.: 7756
##  Median:203   Median:16.00     Median :1338.0   Median :17696
##  Mean   :203   Mean    :16.32     Mean   :1342.0     Mean   :17241
##  3rd Qu.:203   3rd Qu.:18.00     3rd Qu.:1989.2     3rd Qu.:24292
##  Max.     :203   Max.    :20.00     Max.   :2635.0       Max.   :47824
##        d2                      fsc_small                 fsc_perp          fsc_big
##  Min.      :  112     Min.      :10109   Min.     : 0           Min.     :32384
##  1st Qu.  : 9420   1st Qu.  :30978   1st Qu. :12915   1st Qu. :32400
##  Median :18472  Median:35206   Median:17610   Median:32400
##  Mean    :17192   Mean   :34638   Mean    :17124   Mean   :32405
##  3rd Qu. :24036  3rd Qu.:39066    3rd Qu. :21996   3rd Qu.:32416
##  Max.      :53056  Max.    :63640    Max.      :63456   Max.     :32432
##        pe                       chl_small             chl_big               pop
##  Min.      :    0       Min.      : 4648    Min.      :    0       nano      :171
##  1st Qu.  : 1578   1st Qu.  :24058   1st Qu.  : 3704   pico        :340
##  Median : 2256   Median:31192   Median : 8120   synecho:192
##  Mean    : 4458   Mean    :30682   Mean    : 8514   ultra       :297
##  3rd Qu.: 3834   3rd Qu. :38251   3rd Qu. :12596
##  Max.    :29317   Max.     :62541   Max.      :44688

# check if there is any missing values
sum(is.na(seaflow))

## [1] 0

The preliminary examination of the dataset with R provides us the following information.

  • The dataset consists of 1000 observations and 12 variables.
  • All variables are numerical except the class label ‘pop’ is a categorical.
  • No missing value in the data set.

Constructing scatterplots with the ggvis package.

library(ggvis)

# Scatter plots to get initial overview of the data Set
seaflow %>% ggvis(~fsc_small, ~fsc_perp, fill = ~pop) %>% layer_points()

2. fsc_small vs fsc_perp

seaflow %>% ggvis(~chl_small, ~chl_big, fill = ~pop) %>% layer_points()

4. chl_small vs chl_big

seaflow %>% ggvis(~chl_small, ~pe, fill = ~pop) %>% layer_points()

5.chl_small vs pe

seaflow %>% ggvis(~chl_big, ~pe, fill = ~pop) %>% layer_points()

6. chl_big vs pe

K-Means Clustering

K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells [iii].

The following are the K-means cluster analysis of the dataset.

library(“NbClust”)

# a function to plot the total within-groups sums of squares
# against the number of clusters in a K-means
wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type=”b”, xlab=”Number of Clusters”,
ylab=”Within groups sum of squares”)}

# remove column ‘pop’ as it is the class label
df <- seaflow[-12]

# remove columns file_id, time, cell_id, d1, d2, as they are not relevant
df <- df[-1:-5]
head(df)

##   fsc_small fsc_perp fsc_big      pe    chl_small  chl_big
## 1     34677    14944   32400      2216     28237        5072
## 2     37275    20440   32400      1795     36755      14224
## 3     31725    11253   32384      1901     26640           0
## 4     28744    10219   32416      1248     35392      10704
## 5     28861      6101   32400    12989     23421        5920
## 6     31221    13488   32400      1883     27323        6560

# standardize/normalize data
df <- scale(df)
head(df)

##              fsc_small    fsc_perp         fsc_big            pe            chl_small    chl_big
## [1,]  0.005217565  -0.2886005  -0.5177936  -0.4120231  -0.2481986  -0.5633575
## [2,]  0.356434948   0.4388824  -0.5177936  -0.4893935   0.6165285   0.9345090
## [3,] -0.393856227  -0.7771630  -2.3156879  -0.4699130  -0.4103222  -1.3934689
## [4,] -0.796850461  -0.9140294   1.2801007  -0.5899198   0.4781600   0.3584065
## [5,] -0.781033512  -1.4591121  -0.5177936   1.5678136  -0.7371074  -0.4245692
## [6,] -0.461990777  -0.4813253  -0.5177936  -0.4732210  -0.3409857  -0.3198233

Data normalization is one of the preprocessing procedures in machine learning. Since the variables vary in range, they are scaled so as to fall within a small specified range, to prevent one variable overpowering the other variable [iv].

# determine number of clusters
# A bend in the graph can suggest the appropriate number of clusters
wssplot(df)

1. number of clusters

The above plot indicates that there is a distinct drop in within groups sum of squares from 1 to 2 clusters. After 2 clusters the drops cease down, suggesting that a 2-cluster solution may be a good fit to the data.

# set.seed() function to guarantee that the results are reproducible
set.seed(1234)

# create the clusters
nc <- NbClust(df, min.nc=2, max.nc=15, method=”kmeans”)

2. Hubert

##***: The Hubert index is a graphical method of determining the number of ##clusters. In the plot of Hubert index, we seek a significant knee that ##corresponds to a significant increase of the value of the measure i.e the ##significant peak in Hubert index second differences plot.

3. Dindex

##***: The D index is a graphical method of determining the number of clusters. In the plot ##of D index, we seek a significant knee (the significant peak in Dindex second differences ##plot) that corresponds to a significant increase of the value of the measure.
## *****************************************************************************
## * Among all indices:
## * 6 proposed 2 as the best number of clusters
## * 4 proposed 3 as the best number of clusters
## * 4 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 2 proposed 7 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
## * 1 proposed 13 as the best number of clusters
## * 2 proposed 15 as the best number of clusters
##
##                    ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is  2
##
## *******************************************************************

table(nc$Best.n[1,])

##  0  1  2  3  4  5  6  7  9 13 15
##  2  1  6  4  4  1  2  2  1  1  2

# create a bar plot
barplot(table(nc$Best.n[1,]),
xlab=”Number of Clusters”, ylab=”Number of Criteria”,
main=”Number of Clusters Chosen by 26 Criteria”)

Number of Clusters 26 Criteria

The above bar chart shows that NbClust package suggest a 2-cluster solution.

set.seed(1234)

# k-means cluster analysis
fit.km2 <- kmeans(df, 2, nstart=25)
fit.km2$size

## [1] 484 516

fit.km2$centers

##       fsc_small    fsc_perp        fsc_big             pe           chl_small      chl_big
## 1  0.6702573   0.6486115   0.1285569   -0.3841432   0.7726003    0.7313041
## 2 -0.6286910  -0.6083876  -0.1205844   0.3603203   -0.7246871  -0.6859519

# remove columns pop, file_id, time, cell_id, d1, d2
new_seaflow <- seaflow[-12]
new_seaflow <- new_seaflow[-1:-5]

# aggregate to determine variable means for each cluster in original metric
aggregate(new_seaflow, by=list(cluster=fit.km2$cluster), mean)

##   cluster  fsc_small    fsc_perp      fsc_big         pe       chl_small    chl_big
## 1       1      39596.38   22024.46   32405.75  2367.705   38292.39  12982.413
## 2       2      29987.90   12528.07   32403.53  6418.597   23543.35    4322.946

# A cross-tabulation of pop and cluster membership
ct.km2 <- table(seaflow$pop, fit.km2$cluster)
ct.km2
##                         1      2
##   nano        171      0
##   pico           35   305
##   synecho      3   189
##   ultra        275     22

We compare the results of the cluster memberships and the class label ‘pop’ in the cross tabulation. The results suggest a different number of groups.

library(flexclust)

# The adjusted Rand index – measure the agreement between two partitions
# It ranges from -1 (no agreement) to 1 (perfect agreement)
randIndex(ct.km2)

##       ARI
## 0.3998745

We measure the agreement between ‘pop’ the class label and cluster, using the adjusted rank index provided by flexclust package. The result of randIndex shows that the agreement between ‘pop’ and the cluster solution is 0.40. This is far from a prefect agreement.

Let’s try different K values.

# try k=3
fit.km3 <- kmeans(df, 3, nstart=25)
ct.km3 <- table(seaflow$pop, fit.km3$cluster)
ct.km3

##                          1      2       3
##   nano         171      0       0
##   pico            30   309       1
##   synecho      1     57   134
##   ultra        257     40       0

randIndex(ct.km3)

##       ARI
## 0.4786957

# try k=4
fit.km4 <- kmeans(df, 4, nstart=25)
ct.km4 <- table(seaflow$pop, fit.km4$cluster)
ct.km4
##                         1      2     3      4
##   nano        168      3     0      0
##   pico              0  236    0  104
##   synecho      1      0 117    74
##   ultra       106   191     0     0

randIndex(ct.km4)

##       ARI
## 0.3835063

The result of the randIndex for a 3-cluster suggest that it is a better solution as compared to a 2-cluster that recommended by the NbClust package.

 

K-Nearest Neighbours

The k-Nearest Neighbours algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbours, with the object being assigned to the class most common among its k-nearest neighbours (k is a positive integer, typically small) [v].

The following are the k-NN analysis of the dataset.

# read the data
seaflow_data = read.csv( “seaflow_21min_1000.csv”, header=TRUE )

# check the pop attribute to see the division
table(seaflow_data$pop)

##
##    nano    pico  synecho   ultra
##      171       340         192     297

# KNN algorithm
library(class)

# normalization
normalize <- function(x) {
num <- x – min(x)
denom <- max(x) – min(x)
return (num/denom)
}

seaflow_norm <- as.data.frame(lapply(seaflow_data[6:11], normalize))

summary(seaflow_norm)

##        fsc_small             fsc_perp                 fsc_big                    pe
##  Min.     :0.0000   Min.      :0.0000    Min.     :0.0000   Min.     :0.00000
##  1st Qu. :0.3899   1st Qu.  :0.2035   1st Qu. :0.3333   1st Qu.  :0.05383
##  Median:0.4688   Median:0.2775   Median:0.3333   Median:0.07695
##  Mean   :0.4582   Mean    :0.2699   Mean    :0.4293   Mean   :0.15206
##  3rd Qu.:0.5409   3rd Qu. :0.3466   3rd Qu.:0.6667   3rd Qu.:0.13078
##  Max.     :1.0000   Max.     :1.0000   Max.     :1.0000   Max.    :1.00000
##      chl_small              chl_big
##  Min.     :0.0000    Min.     :0.00000
##  1st Qu. :0.3353   1st Qu.  :0.08289
##  Median:0.4585   Median:0.18170
##  Mean   :0.4497   Mean    :0.19052
##  3rd Qu.:0.5804   3rd Qu.:0.28187
##  Max.    :1.0000   Max.     :1.00000

# create training and test sets
set.seed(1234)

# shuffle dataset and have same ratio between target variable in training and test sets
ind <- sample(2, nrow(seaflow_data), replace=TRUE, prob=c(0.67, 0.33))

# define training and test sets
seaflow.training <- seaflow_data[ind==1, 6:11]
seaflow.test <- seaflow_data[ind==2, 6:11]

# define the target variable
seaflow.trainLabels <- seaflow_data[ind==1, 12]
seaflow.testLabels <- seaflow_data[ind==2, 12]

# Build the KNN Model
# Try k=3
seaflow_pred_k3 <- knn(train = seaflow.training, test = seaflow.test, cl = seaflow.trainLabels, k=3)

# evaluate the model performance
library(gmodels)

# cross tabulation or a contingency table
CrossTable(x = seaflow.testLabels, y = seaflow_pred_k3, prop.chisq=FALSE)

##
##    Cell Contents
## |——————————|
## |                                 N |
## |           N / Row Total |
## |             N / Col Total |
## |         N / Table Total |
## |—————————–|
##
## Total Observations in Table:  326
##
##                                   | seaflow_pred_k3
## seaflow.testLabels | nano |   pico | synecho |  ultra |Row Total|
## ————————-|———-|———-|————–|———-|————–|
##                        nano |       47 |         0 |             0 |         2 |            49 |
##                                  |  0.959 |  0.000 |     0.000 |  0.041 |       0.150 |
##                                  |  1.000 |  0.000 |     0.000 |  0.022 |                 |
##                                  |  0.144 |  0.000 |     0.000 |  0.006 |                 |
##————————-|———–|———-|————–|———-|————-|
##                         pico |         0 |     118 |             0 |          2 |         120 |
##                                  | 0.000 |  0.983 |      0.000 |   0.017 |     0.368 |
##                                  | 0.000 |  0.944 |      0.000 |   0.022 |                |
##                                  | 0.000 |  0.362 |      0.000 |   0.006 |                |
## ————————-|———|———–|————-|————|————|
##                  synecho |         0 |         2 |           62 |          0 |           64 |
##                                  | 0.000 |  0.031 |      0.969 |   0.000 |     0.196 |
##                                  | 0.000 |  0.016 |      1.000 |   0.000 |                |
##                                  | 0.000 |  0.006 |      0.190 |   0.000 |                |
## ————————-|———|———–|————-|————|————|
##                        ultra |        0 |          5 |             0 |         88 |          93 |
##                                  | 0.000 |   0.054 |     0.000 |    0.946 |     0.285 |
##                                  | 0.000 |   0.040 |     0.000 |    0.957 |                |
##                                  | 0.000 |   0.015 |     0.000 |    0.270 |                |
## ————————-|———|———–|————–|————|————|
##         Column Total |      47 |     125 |           62 |         92 |         326 |
##                                  | 0.144 |  0.383 |      0.190 |    0.282 |                |
## ————————-|———-|———|————–|————|————-|

ct.k3 <- table(seaflow.testLabels, seaflow_pred_k3)

library(flexclust)

# Adjusted Rand index
randIndex(ct.k3)

##       ARI
## 0.9020429

# Testing different values of k
# let’s try k=5
seaflow_pred_k5 <- knn(train = seaflow.training, test = seaflow.test, cl = seaflow.trainLabels, k=5)
ct.k5 <- table(seaflow.testLabels, seaflow_pred_k5)
randIndex(ct.k5)

##       ARI
## 0.8825074

# let’s try k=7
seaflow_pred_k7 <- knn(train = seaflow.training, test = seaflow.test, cl = seaflow.trainLabels, k=7)
ct.k7 <- table(seaflow.testLabels, seaflow_pred_k7)
randIndex(ct.k7)

##       ARI
## 0.8745358

# let’s try k=9
seaflow_pred_k9 <- knn(train = seaflow.training, test = seaflow.test, cl = seaflow.trainLabels, k=9)
ct.k9 <- table(seaflow.testLabels, seaflow_pred_k9)
randIndex(ct.k9)

##       ARI
## 0.8540793

# let’s try k=13
seaflow_pred_k13 <- knn(train = seaflow.training, test = seaflow.test, cl = seaflow.trainLabels, k=13)
ct.k13 <- table(seaflow.testLabels, seaflow_pred_k13)
randIndex(ct.k13)

##       ARI
## 0.8760217

# let’s try k=15
seaflow_pred_k15 <- knn(train = seaflow.training, test = seaflow.test, cl = seaflow.trainLabels, k=15)
ct.k15 <- table(seaflow.testLabels, seaflow_pred_k15)
randIndex(ct.k15)

##       ARI
## 0.8616283

The above cross tables and randIndex results suggest that k with a value of 3 give the best performance in term of the model’s predictions.

 

Conclusion

K-Means and KNN are two commonly used machine learning algorithms that use distance as a metric. Both algorithms require the user to determine the K-value when building the model, however each of the K-value is referring to a completely different thing. The K-value in K-Means is defined by the number of clusters in the data. While for KNN, K-value is described by the number of nearest neighbour chosen.

K-Means is an unsupervised clustering algorithm, learning from the unlabeled data and trying to partition a set of points into K clusters, such that the points in each cluster are near to each other. KNN is a supervised classification algorithm, learning from the labeled data and trying to classify a point based on the known classification of other points by using the K nearest points.

 

References

[i] https://en.wikipedia.org/wiki/Cluster_analysis

[ii] https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

[iii] https://en.wikipedia.org/wiki/K-means_clustering

[iv] http://www.medwelljournals.com/fulltext/?doi=ijscomp.2009.168.172

[v] https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

AirBnb’s Analysis of Variance

Abstract

Airbnb has conducted a competition in Kaggle for participants to accurately predict where a new user will book their first travel destination. The purpose of this report is to analyze the dataset provided by Airbnb to see if it is possible to create a statistical model of analysis of variance (ANOVA) which anticipate where people are likely to choose their first trip on Airbnb. I first did a preliminary examination on the dataset, then I cleaned the dataset and removed all the missing values. I performed a one-way and a two-way ANOVA. Finally I calculated the goodness of fit and verified if the data has a normal distribution.

 

Data Understanding and Preparation

Data Description

I use the data file named train_users_2.csv for this report. It is the training set of users which was provided by Airbnb and could be downloaded from https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings

train_data <- read.csv(“train_users_2.csv”, header = TRUE)
str(train_data)

## ‘data.frame’                         : 213451 obs. of  16 variables:
##  $ id                                       : Factor w/ 213451 levels “00023iyk9l”,”0005ytdols”,..:
##  $ date_account_created   : Factor w/ 1634 levels “01-01-10″,”01-01-11”,..: 1476
##  $ timestamp_first_active : num  2.01e+13 2.01e+13 2.01e+13 2.01e+13 2.01e+13
##  $ date_first_booking         : Factor w/ 1977 levels “”,”01-01-11″,..: 1 1 107 504
##  $ gender                              : Factor w/ 4 levels “-unknown-“,”FEMALE”,..: 1 3 2 2
##  $ age                                    : int  NA 38 56 42 41 NA 46 47 50 46 …
##  $ signup_method              : Factor w/ 3 levels “basic”,”facebook”,..: 2 2 1 2 1 1 1 1
##  $ signup_flow                    : int  0 0 3 0 0 0 0 0 0 0 …
##  $ language                          : Factor w/ 25 levels “ca”,”cs”,”da”,..: 6 6 6 6 6 6 6 6 6 6
##  $ affiliate_channel            : Factor w/ 8 levels “api”,”content”,..: 3 8 3 3 3 4 4 3 4 4
##  $ affiliate_provider          : Factor w/ 18 levels “baidu”,”bing”,..: 5 9 5 5 5 13 3 5 3
##  $ first_affiliate_tracked   : Factor w/ 8 levels “”,”linked”,”local ops”,..: 8 8 8 8 8 5
##  $ signup_app                     : Factor w/ 4 levels “Android”,”iOS”,..: 4 4 4 4 4 4 4 4 4
##  $ first_device_type           : Factor w/ 9 levels “Android Phone”,..: 6 6 9 6 6 6 6 6 6
##  $ first_browser                 : Factor w/ 52 levels “-unknown-“,”Android
##  $ country_destination     : Factor w/ 12 levels “AU”,”CA”,”DE”,..: 8 8 12 10 12 12

summary(train_data)

##           id         date_account_created timestamp_first_active
##  00023iyk9l:        1   13-05-14:   674      Min.   :2.009e+13
##  0005ytdols:        1   24-06-14:   670      1st Qu.:2.012e+13
##  000guo2307:      1   25-06-14:   636      Median :2.013e+13
##  000wc9mlv3:     1   20-05-14:   632      Mean   :2.013e+13
##  0012yo8hu2:      1   14-05-14:   622      3rd Qu.:2.014e+13
##  001357912w:     1   03-06-14:   602      Max.   :2.014e+13
##  (Other)   :213445   (Other) :209615

##  date_first_booking       gender                     age             signup_method
##             :124543    -unknown-:95688   Min.     :   1.00     basic   :152897
##  22-05-14:   248    FEMALE   :63041   1st Qu.  :  28.00   facebook: 60008
##  11-06-14:   231    MALE        :54440   Median:  34.00    google  :   546
##  24-06-14:   226    OTHER      :282        Mean   :  49.67
##  21-05-14:   225                                      3rd Qu.:  43.00
##  10-06-14:   223                                      Max.   :2014.00
##  (Other) : 87755                                     NA’s   :87990

##   signup_flow        language           affiliate_channel
##  Min.    : 0.000      en     :206314     direct       :137727
##  1st Qu.: 0.000      zh     :  1632       sem-brand    : 26045
##  Median : 0.000    fr     :  1172        sem-non-brand: 18844
##  Mean    : 3.267    es     :   915         other        :  8961
##  3rd Qu.: 0.000     ko     :   747        seo          :  8663
##  Max.    :25.000    de     :   732        api          :  8167
##                               (Other):  1939    (Other)      :  5044

##   affiliate_provider  first_affiliate_tracked   signup_app
##  direct     :137426    untracked:109232           Android:  5454
##  google    : 51693     linked       : 46287            iOS         : 19019
##  other      : 12549     omg          : 43982            Moweb  :  6261
##  craigslist:  3471      tracked-other:  6156      Web       :182717
##  bing        :  2328                        :  6065
##  facebook:  2273    product      :  1556
##  (Other)   :  3711    (Other)      :   173

##        first_device_type           first_browser            country_destination
##  Mac Desktop         :89600    Chrome      :63845    NDF    :124543
##  Windows Desktop:72716   Safari          :45169    US     : 62376
##  iPhone                    :20759   Firefox        :33655    other  : 10094
##  iPad                        :14339    -unknown- :27266    FR     :  5023
##  Other/Unknown  :10667    IE                  :21068    IT     :  2835
##  Android Phone    : 2803     Mobile Safari:19274 GB     :  2324
##  (Other)                  : 2567     (Other)      : 3174        (Other):  6256

The preliminary examination of the dataset with R provides us the following information.

  • The training dataset consists of 213,415 observations and 16 variables.
  • 124,543 of missing values in the date_first_booking, however R does not identify it as NA.
  • 95,688 of unknown genders and 282 labeled as other in gender.
  • 87,990 of missing values in age, minimum age is 1 and the maximum age is 2014.
  • 27,266 of unknown data in the first_browser.
  • 124,543 of NDF (no destination found) for country destination.

 

Data Cleaning

The dataset is messy and contains lots of missing values. I clean the dataset with the following R codes:

# Clean Dataset
# check total of NA before data cleaning
sum(is.na(train_data))

## [1] 87990

# change “” to NA in date_first_booking
train_data$date_first_booking[train_data$date_first_booking==””]=NA
sum(is.na(train_data$date_first_booking))

## [1] 124543

# change -unknown- to NA in gender
train_data$gender[train_data$gender==”-unknown-“]=NA
summary(train_data$gender)

## -unknown-    FEMALE      MALE     OTHER      NA’s
##         0                 63041        54440         282       95688

# change -unknown- to NA in first_browser
train_data$first_browser[train_data$first_browser==”-unknown-“]=NA
sum(is.na(train_data$first_browser))

## [1] 27266

# change NDF in country_destination to NA
train_data$country_destination[train_data$country_destination==”NDF”]=NA
summary(train_data$country_destination)

##     AU     CA      DE      ES       FR       GB       IT    NDF     NL  other
##    539   1428   1061   2249   5023   2324   2835      0    762  10094
##     PT      US       NA’s
##    217  62376 124543

Age_outliers

 

Looking at the age histogram, there are a significant amount of incorrect values in the dataset. I make a reasonable assumption and establish a valid range of ages as (18, 100) and assume all values above 1900 are likely to be the birth years.

# Remove suspicious/incorrect ages
train_data$age[train_data$age<18]=NA
train_data$age[train_data$age>100]=NA
summary(train_data$age)

##    Min. 1st Qu.  Median  Mean  3rd Qu.  Max.      NA’s
##   18.00   28.00   34.00      36.58   42.00     100.00   90493

# check total of NA after data cleaning
sum(is.na(train_data))

## [1] 462533

 

Analysis of Variance

Analysis of Variance (ANOVA), developed by Ronald Fisher, is a collection of statistical models used to analyze the differences between the group means and their associated procedures.[i]

First, I conduct a one-way ANOVA to find if there is a difference between at least one of the variables.

# One-way ANOVA test
names(train_data)

##  [1] “id”                                        “date_account_created”
##  [3] “timestamp_first_active”  “date_first_booking”
##  [5] “gender”                               “age”
##  [7] “signup_method”                “signup_flow”
##  [9] “language”                           “affiliate_channel”
## [11] “affiliate_provider”          “first_affiliate_tracked”
## [13] “signup_app”                     “first_device_type”
## [15] “first_browser”                 “country_destination”

stack_train_data <- stack(train_data)

## Warning in stack.data.frame(train_data): non-vector columns will be ignored

names(stack_train_data)

## [1] “values” “ind”

oneway.test(values~ind,var.equal=TRUE, data=stack_train_data)

##
##  One-way analysis of means
##
## data:  values and ind
## F = 7.9603e+11, num df = 2, denom df = 549860, p-value < 2.2e-16

one_way_aov <- aov(values~ind, data=stack_train_data)
summary(one_way_aov)

##                     Df           Sum Sq       Mean Sq     F value    Pr(>F)
## ind              2             5.292e+31  2.646e+31  7.96e+11  <2e-16 ***
## Residuals   549857 1.828e+25   3.324e+19
## —
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
## 90493 observations deleted due to missingness

Statistically significant results F(2, 549857) = 7.96e+11 and P-value < 0.05 tell us that there are meaningful differences between the variables means. The results also tell us that 90,493 observations have been removed due to missing values. However it does not tell us which variables differ from each other significantly. In order to understand this, I conduct a Tukey HSD (honestly significantly different) post hoc test. Post hoc Tukey test reveals that there are significant differences between all the variables.

# Post-hoc test
TukeyHSD(one_way_aov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
##
## Fit: aov(formula = values ~ ind, data = stack_train_data)
##
## $ind
##                                                                            diff                  lwr                    upr
## signup_flow-age                                    2.927080e+02  -4.837756e+07 4.837814e+07
## timestamp_first_active-age                 2.013083e+13   2.013079e+13 2.013088e+13
## timestamp_first_active-signup_flow 2.013083e+13   2.013079e+13 2.013088e+13
##                                                                      p adj
## signup_flow-age                                       1
## timestamp_first_active-age                    0
## timestamp_first_active-signup_flow    0

I then continue to conduct a two-way ANOVA.

Two-way ANOVA test
two_way_aov <- aov(as.numeric(country_destination) ~ date_first_booking

+ gender + age + signup_method + language + signup_app

+ first_device_type + first_browser, data=train_data)

summary(two_way_aov)

##                                      Df Sum  Sq Mean  Sq F     value   Pr(>F)
## date_first_booking  1884        19527       10.36   1.345  < 2e-16 ***
## gender                        2              56             28.02   3.636  0.026354 *
## age                              1              211         210.62  27.335 1.72e-07 ***
## signup_method        2              26             12.97   1.683  0.185859
## language                   22            487           22.13   2.873  7.49e-06 ***
## signup_app                3             277           92.47  12.001 7.52e-08 ***
## first_device_type      8             228           28.54   3.705  0.000246 ***
## first_browser           33            232           7.03     0.912  0.612104
## Residuals                 49574      381959     7.70
## —
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
## 161921 observations deleted due to missingness

We can see that the P-value for gender is slightly less than 0.05. While the P-values for date_first_booking, age, language, signup_app and first_device_type are extremely low values. The results suggest that these independent variables are statistically significant.

 

Chi-Square Test

Chi-square test also known as “Goodness of fit” test. The chi-square distribution is a distribution that squares the values sampled from a normal distribution. The chi-square distribution is a probability distribution [ii].

# Goodness of fit
# frequency for country_destination
country_des <- table(train_data$country_destination)
# remove column NDF
country_des[-8]

##
##    AU    CA     DE      ES      FR      GB      IT     NL   other  PT    US
##   539  1428  1061  2249  5023  2324  2835  762  10094  217  62376

# Assume the country_destination statistics as below and determine
# whether the dataset supports it as 0.05 significance level

# AU      CA     DE      ES      FR  GB    IT      NL     Other    PT      US
# 0.7%  2.2%  1.1%  2.7%  6%  3%  3.5%  0.8%  10.1%   0.2%  69.7%

# probablity for the country_destination based on the above statistics
country_des_prop <- c(0.007,0.022,0.011,0.027,0.06,0.03,0.035,0.008,0.101,0.002,0.697)

# Chi-Square Test
chisq.test(country_des[-8], p=country_des_prop)

##  Chi-squared test for given probabilities
##
## data:  country_des[-8]
## X-squared = 410.47, df = 10, p-value < 2.2e-16

The P-value is less than 0.05, suggest that a predicted distribution of values is not correct.

 

Normality Tests

In statistics, normality tests are used to determine if a dataset is well-modeled by a normal distribution and to compute how likely it is for a random variable underlying the dataset to be normally distributed.[iii]

Let’s examine whether the sample is from a population with a normal distribution.

# Normality Test
par(mfrow=c(2,2))
hist(train_data$age,main=”Age”)
plot(train_data$age,main=”Age”)
boxplot(train_data$age,main=”Age”)

# Quantile-quantile plot
qqnorm(train_data$age)
qqline(train_data$age, lty=2, col=”red”)

Age_normality_plots

From the histogram, the distribution is somewhat skew to the left. The Normal Q-Q plot shows a slight S-shape, but there is no compelling evidence of non-normality. The data should be treated parametrically.

Let’s try the Shapiro-Wilks normality test.

# Shapiro-Wilks test
shapiro.test(train_data$age)

Error in shapiro.test(train_data$age) :

sample size must be between 3 and 5000

Shapiro-Wilks test only accept arguments for a size between 3 and 5000. We cannot perform the test as our dataset has more than 5000 observations.

 

Conclusion

One-way ANOVA is a technique used to compare means of three or more samples. It can be used only for numerical data [iv]. Two-way ANOVA is a technique that study the relationship between a numerical dependent variable and categorical independent variables. There are better algorithms and methods compared to ANOVA as the statistical model that describes where people are likely to travel as their first trip on Airbnb.

 

References

[i] https://en.wikipedia.org/wiki/Analysis_of_variance

[ii] McClave and Sincich, Statistics, Eleventh Edition, Prentice Hall

[iii] https://en.wikipedia.org/wiki/Normality_test

[iv] https://en.wikipedia.org/wiki/One-way_analysis_of_variance

 

Anscombe’s Quartet

Anscombe’s quartet was created by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties [i]. It consists of four datasets that have similar statistical properties, yet appear very different when graphed.

Let’s start the statistical analysis by constructing a new four datasets of Anscombe’s quartet.

New_datasets

We are going to explore these datasets by using R and ggplot2.

 

Statistical Properties

First, let’s look at the statistical properties of the datasets.

# Mean of x
sapply(1:4, function(x) mean(new_anscombe[, x]))

## [1] 18 18 18 18

# Variance of x
sapply(1:4, function(x) var(new_anscombe[, x]))

## [1] 44 44 44 44

# Mean of y
round(sapply(5:8, function(x) mean(new_anscombe[, x])),2)

## [1] 15 15 15 15

# Variance of y
round(sapply(5:8, function(x) var(new_anscombe[, x])),2)

## [1] 16.51 16.51 16.49 16.49

# Pearson’s Correlation
round(sapply(1:4, function(x) cor(new_anscombe[, x], new_anscombe[, x+4])),3)

## [1] 0.816 0.816 0.816 0.817

# Linear regression models
model1 <- lm(y ~ x, subset(anscombe_data, Set == “Anscombe Set 1”))
model2 <- lm(y ~ x, subset(anscombe_data, Set == “Anscombe Set 2”))
model3 <- lm(y ~ x, subset(anscombe_data, Set == “Anscombe Set 3”))
model4 <- lm(y ~ x, subset(anscombe_data, Set == “Anscombe Set 4”))

# Coefficients
coef(model1)

## (Intercept)           x
##   6.0001818   0.5000909

coef(model2)

## (Intercept)           x
##    6.001818    0.500000

coef(model3)

## (Intercept)           x
##   6.0049091   0.4997273

coef(model4)

## (Intercept)           x
##   6.0034545   0.4999091

We can summarize the statistics calculation results for all four datasets are as follows:

The sample size is 11 data points for the X and Y variables.

The sample mean is 18 for the X data points and 15 for the Y data points.

The sample variance is 44 for the X data points and 16.49-16.51 for the Y data points.

The sample Pearson’s Correlation Coefficient is 0.816-0.817 which means the X and Y variables are positively linearly associated.

A linear regression (line of best fit) follows the equation y = 6.00 + 0.50x

Interestingly we found that all of the statistical properties of the four datasets are the same. Without inspecting each of the dataset further, one could easily say that these four datasets are very much the same.

 

Data Visualizations

Anscombe argues that “graphs are essential to good statistical analysis. Most kinds of statistical calculations rest on assumptions about the behavior of the data. Those assumptions may be false, and then the calculations may be misleading. We ought always to try to check whether the assumptions are reasonably correct; and if they are wrong we ought to be able to perceive in what ways they are wrong. Graphs are very valuable for these purposes.” [ii]

Now, let’s construct scatterplot to visualize each of the dataset to see if we can deduce any more information from the datasets by comparing and contrasting the four different graphs.

# Plotting the datasets
library(ggplot2)

anscombe1 <- data.frame(x = new_anscombe[[“x1”]], y = new_anscombe[[“y1”]], Set = “Anscombe Set 1”)
anscombe2 <- data.frame(x = new_anscombe[[“x2”]], y = new_anscombe[[“y2”]], Set = “Anscombe Set 2”)
anscombe3 <- data.frame(x = new_anscombe[[“x3”]], y = new_anscombe[[“y3”]], Set = “Anscombe Set 3”)
anscombe4 <- data.frame(x = new_anscombe[[“x4”]], y = new_anscombe[[“y4”]], Set = “Anscombe Set 4”)

anscombe_data <- rbind(anscombe1, anscombe2, anscombe3, anscombe4)

g1 <- ggplot(anscombe_data, aes(x = x, y = y)) +
geom_point(color = “darkorange”, size = 3) +
facet_wrap(~Set, ncol = 2) +
ggtitle(“Anscombe’s Quartet”) +
theme(plot.title = element_text(colour=”blue”, face=”bold”, size=12))

Anscombe_graph1

# plot the linear model line
g2 <- g1 + geom_smooth(formula = y ~ x, method = “lm”, se = FALSE, data = anscombe_data)

Anscombe_graph2

From the first gleam of the four graphs, we can quickly see that these datasets look completely different and each tells individual story.

Anscombe Set 1 looks somewhat linear with a positive slope and some variance.

Anscombe Set 2 looks like it is a curve with a positive slope at the beginning and ends with a negative slope. All the data points are following a pattern of a neat curve.

Anscombe Set 3 looks completely linear with a positive slope. However we can see that there is an outlier.

Anscombe Set 4 looks completely vertical as x remains constant and there is also an outlier.

From this we can clearly see how the statistics calculation may be misleading. Without looking at the graphs and by just knowing the Pearson Correlation Coefficient is 0.816, pretty close to 1, we could have assumed that the data points would hang out close to that line. Unfortunately, it might not be the case.

The variability for Anscombe Set 2 is influenced by the fact that the data has a pattern of a curve and not a linear relationship. Anscombe Set 3 has a perfect linear model, however due to one outlier the fit is skewed off that perfect line. Anscombe Set 4 again indicates the effect of an outlier on a sample. Once it is graphed, it is so evident that the linear fit between x and y does not make any sense. The means and variances for both Anscombe Set 3 and 4 are heavily influenced by their respectively outliers.

 

 Conclusion

Anscombe’s quartet has showed us the importance of visualizing data and the dangers of reliance on summary statistics. Truly effective data analysis should consist of both numerical statistics and clear visualizations.

 

Reference:

[i] https://en.wikipedia.org/wiki/Anscombe%27s_quartet

[ii] Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. American Statistician 27 (1): 17–21.