Anscombe’s quartet was created by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties [i]. It consists of four datasets that have similar statistical properties, yet appear very different when graphed.
Let’s start the statistical analysis by constructing a new four datasets of Anscombe’s quartet.
We are going to explore these datasets by using R and ggplot2.
First, let’s look at the statistical properties of the datasets.
# Mean of x
sapply(1:4, function(x) mean(new_anscombe[, x]))
##  18 18 18 18
# Variance of x
sapply(1:4, function(x) var(new_anscombe[, x]))
##  44 44 44 44
# Mean of y
round(sapply(5:8, function(x) mean(new_anscombe[, x])),2)
##  15 15 15 15
# Variance of y
round(sapply(5:8, function(x) var(new_anscombe[, x])),2)
##  16.51 16.51 16.49 16.49
# Pearson’s Correlation
round(sapply(1:4, function(x) cor(new_anscombe[, x], new_anscombe[, x+4])),3)
##  0.816 0.816 0.816 0.817
# Linear regression models
model1 <- lm(y ~ x, subset(anscombe_data, Set == “Anscombe Set 1”))
model2 <- lm(y ~ x, subset(anscombe_data, Set == “Anscombe Set 2”))
model3 <- lm(y ~ x, subset(anscombe_data, Set == “Anscombe Set 3”))
model4 <- lm(y ~ x, subset(anscombe_data, Set == “Anscombe Set 4”))
## (Intercept) x
## 6.0001818 0.5000909
## (Intercept) x
## 6.001818 0.500000
## (Intercept) x
## 6.0049091 0.4997273
## (Intercept) x
## 6.0034545 0.4999091
We can summarize the statistics calculation results for all four datasets are as follows:
The sample size is 11 data points for the X and Y variables.
The sample mean is 18 for the X data points and 15 for the Y data points.
The sample variance is 44 for the X data points and 16.49-16.51 for the Y data points.
The sample Pearson’s Correlation Coefficient is 0.816-0.817 which means the X and Y variables are positively linearly associated.
A linear regression (line of best fit) follows the equation y = 6.00 + 0.50x
Interestingly we found that all of the statistical properties of the four datasets are the same. Without inspecting each of the dataset further, one could easily say that these four datasets are very much the same.
Anscombe argues that “graphs are essential to good statistical analysis. Most kinds of statistical calculations rest on assumptions about the behavior of the data. Those assumptions may be false, and then the calculations may be misleading. We ought always to try to check whether the assumptions are reasonably correct; and if they are wrong we ought to be able to perceive in what ways they are wrong. Graphs are very valuable for these purposes.” [ii]
Now, let’s construct scatterplot to visualize each of the dataset to see if we can deduce any more information from the datasets by comparing and contrasting the four different graphs.
# Plotting the datasets
anscombe1 <- data.frame(x = new_anscombe[[“x1”]], y = new_anscombe[[“y1”]], Set = “Anscombe Set 1”)
anscombe2 <- data.frame(x = new_anscombe[[“x2”]], y = new_anscombe[[“y2”]], Set = “Anscombe Set 2”)
anscombe3 <- data.frame(x = new_anscombe[[“x3”]], y = new_anscombe[[“y3”]], Set = “Anscombe Set 3”)
anscombe4 <- data.frame(x = new_anscombe[[“x4”]], y = new_anscombe[[“y4”]], Set = “Anscombe Set 4”)
anscombe_data <- rbind(anscombe1, anscombe2, anscombe3, anscombe4)
g1 <- ggplot(anscombe_data, aes(x = x, y = y)) +
geom_point(color = “darkorange”, size = 3) +
facet_wrap(~Set, ncol = 2) +
ggtitle(“Anscombe’s Quartet”) +
theme(plot.title = element_text(colour=”blue”, face=”bold”, size=12))
# plot the linear model line
g2 <- g1 + geom_smooth(formula = y ~ x, method = “lm”, se = FALSE, data = anscombe_data)
From the first gleam of the four graphs, we can quickly see that these datasets look completely different and each tells individual story.
Anscombe Set 1 looks somewhat linear with a positive slope and some variance.
Anscombe Set 2 looks like it is a curve with a positive slope at the beginning and ends with a negative slope. All the data points are following a pattern of a neat curve.
Anscombe Set 3 looks completely linear with a positive slope. However we can see that there is an outlier.
Anscombe Set 4 looks completely vertical as x remains constant and there is also an outlier.
From this we can clearly see how the statistics calculation may be misleading. Without looking at the graphs and by just knowing the Pearson Correlation Coefficient is 0.816, pretty close to 1, we could have assumed that the data points would hang out close to that line. Unfortunately, it might not be the case.
The variability for Anscombe Set 2 is influenced by the fact that the data has a pattern of a curve and not a linear relationship. Anscombe Set 3 has a perfect linear model, however due to one outlier the fit is skewed off that perfect line. Anscombe Set 4 again indicates the effect of an outlier on a sample. Once it is graphed, it is so evident that the linear fit between x and y does not make any sense. The means and variances for both Anscombe Set 3 and 4 are heavily influenced by their respectively outliers.
Anscombe’s quartet has showed us the importance of visualizing data and the dangers of reliance on summary statistics. Truly effective data analysis should consist of both numerical statistics and clear visualizations.