AirBnb’s Analysis of Variance

Abstract

Airbnb has conducted a competition in Kaggle for participants to accurately predict where a new user will book their first travel destination. The purpose of this report is to analyze the dataset provided by Airbnb to see if it is possible to create a statistical model of analysis of variance (ANOVA) which anticipate where people are likely to choose their first trip on Airbnb. I first did a preliminary examination on the dataset, then I cleaned the dataset and removed all the missing values. I performed a one-way and a two-way ANOVA. Finally I calculated the goodness of fit and verified if the data has a normal distribution.

 

Data Understanding and Preparation

Data Description

I use the data file named train_users_2.csv for this report. It is the training set of users which was provided by Airbnb and could be downloaded from https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings

train_data <- read.csv(“train_users_2.csv”, header = TRUE)
str(train_data)

## ‘data.frame’                         : 213451 obs. of  16 variables:
##  $ id                                       : Factor w/ 213451 levels “00023iyk9l”,”0005ytdols”,..:
##  $ date_account_created   : Factor w/ 1634 levels “01-01-10″,”01-01-11”,..: 1476
##  $ timestamp_first_active : num  2.01e+13 2.01e+13 2.01e+13 2.01e+13 2.01e+13
##  $ date_first_booking         : Factor w/ 1977 levels “”,”01-01-11″,..: 1 1 107 504
##  $ gender                              : Factor w/ 4 levels “-unknown-“,”FEMALE”,..: 1 3 2 2
##  $ age                                    : int  NA 38 56 42 41 NA 46 47 50 46 …
##  $ signup_method              : Factor w/ 3 levels “basic”,”facebook”,..: 2 2 1 2 1 1 1 1
##  $ signup_flow                    : int  0 0 3 0 0 0 0 0 0 0 …
##  $ language                          : Factor w/ 25 levels “ca”,”cs”,”da”,..: 6 6 6 6 6 6 6 6 6 6
##  $ affiliate_channel            : Factor w/ 8 levels “api”,”content”,..: 3 8 3 3 3 4 4 3 4 4
##  $ affiliate_provider          : Factor w/ 18 levels “baidu”,”bing”,..: 5 9 5 5 5 13 3 5 3
##  $ first_affiliate_tracked   : Factor w/ 8 levels “”,”linked”,”local ops”,..: 8 8 8 8 8 5
##  $ signup_app                     : Factor w/ 4 levels “Android”,”iOS”,..: 4 4 4 4 4 4 4 4 4
##  $ first_device_type           : Factor w/ 9 levels “Android Phone”,..: 6 6 9 6 6 6 6 6 6
##  $ first_browser                 : Factor w/ 52 levels “-unknown-“,”Android
##  $ country_destination     : Factor w/ 12 levels “AU”,”CA”,”DE”,..: 8 8 12 10 12 12

summary(train_data)

##           id         date_account_created timestamp_first_active
##  00023iyk9l:        1   13-05-14:   674      Min.   :2.009e+13
##  0005ytdols:        1   24-06-14:   670      1st Qu.:2.012e+13
##  000guo2307:      1   25-06-14:   636      Median :2.013e+13
##  000wc9mlv3:     1   20-05-14:   632      Mean   :2.013e+13
##  0012yo8hu2:      1   14-05-14:   622      3rd Qu.:2.014e+13
##  001357912w:     1   03-06-14:   602      Max.   :2.014e+13
##  (Other)   :213445   (Other) :209615

##  date_first_booking       gender                     age             signup_method
##             :124543    -unknown-:95688   Min.     :   1.00     basic   :152897
##  22-05-14:   248    FEMALE   :63041   1st Qu.  :  28.00   facebook: 60008
##  11-06-14:   231    MALE        :54440   Median:  34.00    google  :   546
##  24-06-14:   226    OTHER      :282        Mean   :  49.67
##  21-05-14:   225                                      3rd Qu.:  43.00
##  10-06-14:   223                                      Max.   :2014.00
##  (Other) : 87755                                     NA’s   :87990

##   signup_flow        language           affiliate_channel
##  Min.    : 0.000      en     :206314     direct       :137727
##  1st Qu.: 0.000      zh     :  1632       sem-brand    : 26045
##  Median : 0.000    fr     :  1172        sem-non-brand: 18844
##  Mean    : 3.267    es     :   915         other        :  8961
##  3rd Qu.: 0.000     ko     :   747        seo          :  8663
##  Max.    :25.000    de     :   732        api          :  8167
##                               (Other):  1939    (Other)      :  5044

##   affiliate_provider  first_affiliate_tracked   signup_app
##  direct     :137426    untracked:109232           Android:  5454
##  google    : 51693     linked       : 46287            iOS         : 19019
##  other      : 12549     omg          : 43982            Moweb  :  6261
##  craigslist:  3471      tracked-other:  6156      Web       :182717
##  bing        :  2328                        :  6065
##  facebook:  2273    product      :  1556
##  (Other)   :  3711    (Other)      :   173

##        first_device_type           first_browser            country_destination
##  Mac Desktop         :89600    Chrome      :63845    NDF    :124543
##  Windows Desktop:72716   Safari          :45169    US     : 62376
##  iPhone                    :20759   Firefox        :33655    other  : 10094
##  iPad                        :14339    -unknown- :27266    FR     :  5023
##  Other/Unknown  :10667    IE                  :21068    IT     :  2835
##  Android Phone    : 2803     Mobile Safari:19274 GB     :  2324
##  (Other)                  : 2567     (Other)      : 3174        (Other):  6256

The preliminary examination of the dataset with R provides us the following information.

  • The training dataset consists of 213,415 observations and 16 variables.
  • 124,543 of missing values in the date_first_booking, however R does not identify it as NA.
  • 95,688 of unknown genders and 282 labeled as other in gender.
  • 87,990 of missing values in age, minimum age is 1 and the maximum age is 2014.
  • 27,266 of unknown data in the first_browser.
  • 124,543 of NDF (no destination found) for country destination.

 

Data Cleaning

The dataset is messy and contains lots of missing values. I clean the dataset with the following R codes:

# Clean Dataset
# check total of NA before data cleaning
sum(is.na(train_data))

## [1] 87990

# change “” to NA in date_first_booking
train_data$date_first_booking[train_data$date_first_booking==””]=NA
sum(is.na(train_data$date_first_booking))

## [1] 124543

# change -unknown- to NA in gender
train_data$gender[train_data$gender==”-unknown-“]=NA
summary(train_data$gender)

## -unknown-    FEMALE      MALE     OTHER      NA’s
##         0                 63041        54440         282       95688

# change -unknown- to NA in first_browser
train_data$first_browser[train_data$first_browser==”-unknown-“]=NA
sum(is.na(train_data$first_browser))

## [1] 27266

# change NDF in country_destination to NA
train_data$country_destination[train_data$country_destination==”NDF”]=NA
summary(train_data$country_destination)

##     AU     CA      DE      ES       FR       GB       IT    NDF     NL  other
##    539   1428   1061   2249   5023   2324   2835      0    762  10094
##     PT      US       NA’s
##    217  62376 124543

Age_outliers

 

Looking at the age histogram, there are a significant amount of incorrect values in the dataset. I make a reasonable assumption and establish a valid range of ages as (18, 100) and assume all values above 1900 are likely to be the birth years.

# Remove suspicious/incorrect ages
train_data$age[train_data$age<18]=NA
train_data$age[train_data$age>100]=NA
summary(train_data$age)

##    Min. 1st Qu.  Median  Mean  3rd Qu.  Max.      NA’s
##   18.00   28.00   34.00      36.58   42.00     100.00   90493

# check total of NA after data cleaning
sum(is.na(train_data))

## [1] 462533

 

Analysis of Variance

Analysis of Variance (ANOVA), developed by Ronald Fisher, is a collection of statistical models used to analyze the differences between the group means and their associated procedures.[i]

First, I conduct a one-way ANOVA to find if there is a difference between at least one of the variables.

# One-way ANOVA test
names(train_data)

##  [1] “id”                                        “date_account_created”
##  [3] “timestamp_first_active”  “date_first_booking”
##  [5] “gender”                               “age”
##  [7] “signup_method”                “signup_flow”
##  [9] “language”                           “affiliate_channel”
## [11] “affiliate_provider”          “first_affiliate_tracked”
## [13] “signup_app”                     “first_device_type”
## [15] “first_browser”                 “country_destination”

stack_train_data <- stack(train_data)

## Warning in stack.data.frame(train_data): non-vector columns will be ignored

names(stack_train_data)

## [1] “values” “ind”

oneway.test(values~ind,var.equal=TRUE, data=stack_train_data)

##
##  One-way analysis of means
##
## data:  values and ind
## F = 7.9603e+11, num df = 2, denom df = 549860, p-value < 2.2e-16

one_way_aov <- aov(values~ind, data=stack_train_data)
summary(one_way_aov)

##                     Df           Sum Sq       Mean Sq     F value    Pr(>F)
## ind              2             5.292e+31  2.646e+31  7.96e+11  <2e-16 ***
## Residuals   549857 1.828e+25   3.324e+19
## —
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
## 90493 observations deleted due to missingness

Statistically significant results F(2, 549857) = 7.96e+11 and P-value < 0.05 tell us that there are meaningful differences between the variables means. The results also tell us that 90,493 observations have been removed due to missing values. However it does not tell us which variables differ from each other significantly. In order to understand this, I conduct a Tukey HSD (honestly significantly different) post hoc test. Post hoc Tukey test reveals that there are significant differences between all the variables.

# Post-hoc test
TukeyHSD(one_way_aov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
##
## Fit: aov(formula = values ~ ind, data = stack_train_data)
##
## $ind
##                                                                            diff                  lwr                    upr
## signup_flow-age                                    2.927080e+02  -4.837756e+07 4.837814e+07
## timestamp_first_active-age                 2.013083e+13   2.013079e+13 2.013088e+13
## timestamp_first_active-signup_flow 2.013083e+13   2.013079e+13 2.013088e+13
##                                                                      p adj
## signup_flow-age                                       1
## timestamp_first_active-age                    0
## timestamp_first_active-signup_flow    0

I then continue to conduct a two-way ANOVA.

Two-way ANOVA test
two_way_aov <- aov(as.numeric(country_destination) ~ date_first_booking

+ gender + age + signup_method + language + signup_app

+ first_device_type + first_browser, data=train_data)

summary(two_way_aov)

##                                      Df Sum  Sq Mean  Sq F     value   Pr(>F)
## date_first_booking  1884        19527       10.36   1.345  < 2e-16 ***
## gender                        2              56             28.02   3.636  0.026354 *
## age                              1              211         210.62  27.335 1.72e-07 ***
## signup_method        2              26             12.97   1.683  0.185859
## language                   22            487           22.13   2.873  7.49e-06 ***
## signup_app                3             277           92.47  12.001 7.52e-08 ***
## first_device_type      8             228           28.54   3.705  0.000246 ***
## first_browser           33            232           7.03     0.912  0.612104
## Residuals                 49574      381959     7.70
## —
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
## 161921 observations deleted due to missingness

We can see that the P-value for gender is slightly less than 0.05. While the P-values for date_first_booking, age, language, signup_app and first_device_type are extremely low values. The results suggest that these independent variables are statistically significant.

 

Chi-Square Test

Chi-square test also known as “Goodness of fit” test. The chi-square distribution is a distribution that squares the values sampled from a normal distribution. The chi-square distribution is a probability distribution [ii].

# Goodness of fit
# frequency for country_destination
country_des <- table(train_data$country_destination)
# remove column NDF
country_des[-8]

##
##    AU    CA     DE      ES      FR      GB      IT     NL   other  PT    US
##   539  1428  1061  2249  5023  2324  2835  762  10094  217  62376

# Assume the country_destination statistics as below and determine
# whether the dataset supports it as 0.05 significance level

# AU      CA     DE      ES      FR  GB    IT      NL     Other    PT      US
# 0.7%  2.2%  1.1%  2.7%  6%  3%  3.5%  0.8%  10.1%   0.2%  69.7%

# probablity for the country_destination based on the above statistics
country_des_prop <- c(0.007,0.022,0.011,0.027,0.06,0.03,0.035,0.008,0.101,0.002,0.697)

# Chi-Square Test
chisq.test(country_des[-8], p=country_des_prop)

##  Chi-squared test for given probabilities
##
## data:  country_des[-8]
## X-squared = 410.47, df = 10, p-value < 2.2e-16

The P-value is less than 0.05, suggest that a predicted distribution of values is not correct.

 

Normality Tests

In statistics, normality tests are used to determine if a dataset is well-modeled by a normal distribution and to compute how likely it is for a random variable underlying the dataset to be normally distributed.[iii]

Let’s examine whether the sample is from a population with a normal distribution.

# Normality Test
par(mfrow=c(2,2))
hist(train_data$age,main=”Age”)
plot(train_data$age,main=”Age”)
boxplot(train_data$age,main=”Age”)

# Quantile-quantile plot
qqnorm(train_data$age)
qqline(train_data$age, lty=2, col=”red”)

Age_normality_plots

From the histogram, the distribution is somewhat skew to the left. The Normal Q-Q plot shows a slight S-shape, but there is no compelling evidence of non-normality. The data should be treated parametrically.

Let’s try the Shapiro-Wilks normality test.

# Shapiro-Wilks test
shapiro.test(train_data$age)

Error in shapiro.test(train_data$age) :

sample size must be between 3 and 5000

Shapiro-Wilks test only accept arguments for a size between 3 and 5000. We cannot perform the test as our dataset has more than 5000 observations.

 

Conclusion

One-way ANOVA is a technique used to compare means of three or more samples. It can be used only for numerical data [iv]. Two-way ANOVA is a technique that study the relationship between a numerical dependent variable and categorical independent variables. There are better algorithms and methods compared to ANOVA as the statistical model that describes where people are likely to travel as their first trip on Airbnb.

 

References

[i] https://en.wikipedia.org/wiki/Analysis_of_variance

[ii] McClave and Sincich, Statistics, Eleventh Edition, Prentice Hall

[iii] https://en.wikipedia.org/wiki/Normality_test

[iv] https://en.wikipedia.org/wiki/One-way_analysis_of_variance

 

Leave a Reply

Your email address will not be published. Required fields are marked *