Monday, January 6, 2014

Why Plot Your Data as Part of Your Analysis?

Suppose you wanted to compare the following dataset from four independent experiments, where 'X' was the continuous independent variable and 'y' was the continuous dependent variable:

Dataset 1
X y
1 10.00 8.04
2 8.00 6.95
3 13.00 7.58
4 9.00 8.81
5 11.00 8.33
6 14.00 9.96
7 6.00 7.24
8 4.00 4.26
9 12.00 10.84
10 7.00 4.82
11 5.00 5.68
Dataset 2
X y
1 10.00 9.14
2 8.00 8.14
3 13.00 8.74
4 9.00 8.77
5 11.00 9.26
6 14.00 8.10
7 6.00 6.13
8 4.00 3.10
9 12.00 9.13
10 7.00 7.26
11 5.00 4.74
Dataset 3
X y
1 10.00 7.46
2 8.00 6.77
3 13.00 12.74
4 9.00 7.11
5 11.00 7.81
6 14.00 8.84
7 6.00 6.08
8 4.00 5.39
9 12.00 8.15
10 7.00 6.42
11 5.00 5.73
Dataset 4
X y
1 8.00 6.58
2 8.00 5.76
3 8.00 7.71
4 8.00 8.84
5 8.00 8.47
6 8.00 7.04
7 8.00 5.25
8 19.00 12.50
9 8.00 5.56
10 8.00 7.91
11 8.00 6.89

After taking a quick look at the numbers, the datasets seem relatively similar, so you decide to compare the means and variances for these data and you obtain the following:

dataset Mean.x Variance.x Mean.y Variance.y
1 1 9.00 11.00 7.50 4.13
2 2 9.00 11.00 7.50 4.13
3 3 9.00 11.00 7.50 4.12
4 4 9.00 11.00 7.50 4.12

Now the data seem very similar. But what about the effect of the independent variable? To find out, you compare simple regressions of y on X for each of the datasets:

Statistical models
Model 1
(Intercept) 3.00*
(1.12)
X 0.50**
(0.12)
R2 0.67
Adj. R2 0.63
Num. obs. 11
***p < 0.001, **p < 0.01, *p < 0.05
Statistical models
Dataset 1 Dataset 2 Dataset 3 Dataset 4
(Intercept) 3.00* 3.00* 3.00* 3.00*
(1.12) (1.13) (1.12) (1.12)
X 0.50** 0.50** 0.50** 0.50**
(0.12) (0.12) (0.12) (0.12)
R2 0.67 0.67 0.67 0.67
Adj. R2 0.63 0.63 0.63 0.63
Num. obs. 11 11 11 11
***p < 0.001, **p < 0.01, *p < 0.05

The means, variances, regression coefficients, and effect sizes are nearly identical! These data must be nearly identical... except they're not: the relationship between the X and y is very different for each set. While there is a generally linear relationship between the variables in Dataset 1, the relationship is polynomial in Dataset 2. Dataset 2 and Dataset 3 both have a single extreme outlier, though in different forms.

Anscombe's Quartet

Francis Anscombe created these data in a 1973 paper to demonstrate the importance of data visualization. Though the datasets have nearly identical statistical properties, they in fact represent 4 very different relationships between X and y. Even with only 11 observations in each dataset, it is relatively difficult to see this from just looking at the numbers, and would be almost impossible if one was looking at an SPSS or Excel spreadsheet with hundreds of observations. Furthermore, depending on the experimental hypothesis, these different relationships would each require a different statistical technique in order to properly analyze the data.

For these reasons, it's a good idea to always visualize your data as part of your analysis, not just as a last step in preparation for a poster or publication. It can help you to select the right analysis, and to avoid making poor statistical inferences!

Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17-21.

The Anscombe Quartet in R


require(reshape2)
require(data.table)


# Get the Anscombe Data
dt.anscombe <- data.table(Dataset=rep(c("Dataset 1", "Dataset 2", 
                                        "Dataset 3", "Dataset 4"),
                                      each=11),
                          X=unlist(anscombe[,1:4]),
                          y=unlist(anscombe[,5:8]),key="Dataset")


# Summary Stats
## Means
dt.anscombe[,lapply(.SD[,list(X,y)],mean),by=Dataset]
## Variances
dt.anscombe[,lapply(.SD[,list(X,y)],var),by=Dataset]


# Plot It
ggplot(data=dt.anscombe, aes(X,y)) + 
  geom_smooth(method='lm', fullrange=TRUE, 
              se=FALSE, color="steelblue") +
  geom_point(color="firebrick") + 
  facet_wrap(~Dataset) + 
  theme_bw()
  labs(title="Anscombe's Quartet")

No comments:

Post a Comment