Monday, January 27, 2014

Free Online Statistics and Methods Courses

Want to expand your statistics and methods knowledge at no cost -- except your time? Coursera hosts free online courses on a variety of research and analytic methods, among other topics. For example, Model Thinking, a 10-week course on how to think in terms of models (topics addressed include, for example, cellular automata and markov processes), will be offered beginning February 3rd. Statistical Analysis of fMRI Data, a 6-week course, will be offered beginning February 24th. If you don't see any topics that interest you at this time, you can create an account in advance and elect to receive emails when courses that might be of interest to you will be offered.

Monday, January 20, 2014

Open-Ended Questions in Survey Research

Controversy initially erupted over the use of open-ended (or free response) and close-ended questions in surveys and interviews (Converse, 1984) in the 1940’s. While close-ended questions rose to dominance due in large part to the expense of processing and analyzing free response data (Converse, 1984), interest in mixed methods in survey research has re-emerged in the last several years. Driving this resurgence is the value open-ended questions can add to the interpretation of responses; including open-ended prompts in surveys populated with otherwise close-ended questions provides insight into the considerations, concerns, and thought processes of survey participants. Specifically, open-ended comments are forwarded as  useful in understanding the replies to closed questions (Driscoll, Appiah-Yeboah, Salib, & Rupert, 2007; Garcia et al., 2004), providing more depth to the topics discussed in the survey (Garcia et al., 2004), identifying new research issues (Garcia et al., 2004), obtaining feedback on the research process (Garcia et al., 2004), and avoiding bias that may result from suggesting responses to participants (Reja, Manfreda, Hlebec, & Vehovar, 2003).

Despite the many advantages of using open-ended comments in survey research, such questions introduce a unique set of challenges and concerns. Most notably, open-ended questions require extensive coding and are often associated with higher levels of non-response (Reja et al., 2003). How, then, can open-ended comments be employed most productively in research efforts? We provide an outline below of best practices in the use of open-ended questions as suggested by research in this area.

  1. Target open-ended question prompts toward specific topics. Responses to general (e.g., “if you have any additional comments, please provide them below”) prompts are more likely to vary in relevance and scope (Evans et al., 2005; Garcia et al., 2004) and may not provide the level of detail or kind of information desired.
  2. To increase response rate on open-ended questions, include targeted questions throughout the survey rather than only using a general open-ended question at the end of the survey. Open-ended questions asked at the end of a survey may elicit shorter answers than open-ended questions asked earlier in a survey (Galesic & Bosnjak, 2009).
  3. Assess response bias to open-ended comments. Certain people, including those with education (Garcia et al., 2004), more interest in the survey topic (Geer, 1988; 1991), higher perceptions of survey value (Rogelberg, Fisher, Maynard, Hakel, & Horvath, 2001), and more negative experiences (Evans et al., 2005; Garcia et al., 2004; Poncheri, Lindberg, Thompson, & Surface, 2008) may be more likely to respond to open-ended questions. Consequently, responses to open-ended comments may not be representative of all respondents’ opinions.
  4. To reduce negativity bias in – and potentially boost response rate to – responses to open-ended questions, provide more detailed, motivating instructions in open-ended item stems (Smyth, Dillman, Christian, & McBride, 2009). Including explanations or instructions in the stem (e.g., emphasizing the importance of open-ended responses to the project) may improve open-ended item response length, elaboration on themes, and item response rate (Smyth et al., 2009). However, it is unclear whether these instructions would substantively affect the negativity of responses in addition to the response rate.
Using open-ended questions in survey research can illuminate responses to close-ended questions, providing researchers with richer data. To minimize the potential weaknesses of open-ended questions, the literature suggests that researchers should use targeted open-ended questions with motivating instructions throughout their surveys rather than only using a general open-ended question at the end of the survey. Additionally, researchers should assess how generalizable the content of open-ended responses is by comparing the characteristics and attitudes of participants who responded and did not respond to open-ended questions.

References
Converse, J. M. (1984). Strong arguments and weak evidence: The open/closed questioning controversy of the 1940’s. Public Opinion Quarterly, 48, 267-282.
Driscoll, D. L., Appiah-Yeboah, A., Salib, P., & Rupert, D. J. (2007). Merging qualitative and quantitative data in mixed methods research: How to an why not. Ecological and Environmental Anthropology, 3, 19-28.
Evans, J., Lambert, T. W., & Goldacre, M. J. (2005). Postal surveys of doctors’ careers: Who writes comments and what do they write about? Quality & Quantity, 39, 217-239.
Galesic, M. & Bosnjak, M. (2009). Effects of questionnaire length on participation and indicators of response quality in a web survey. Public Opinion Quarterly, 73, 349-360.
Garcia, J., Evans, J., & Reshaw, M. (2004). “Is there anything else you would like to tell us” – Methodological issues in the use of free-text comments from postal surveys. Quality & Quantity, 38, 113-125.
Geer, J. G. (1991). Do open-ended questions measure “salient” issues? Public Opinion Quarterly, 55, 360-370.
Geer, J. G. (1988). What do open-ended questions measure? Public Opinion Quarterly, 52, 365-371.
Poncheri, R. M., Lingberg, J. T., Thompson, L. F. & Surface, E. A. (2008). A comment on employee surveys: Negativity bias in open-ended responses. Organizational Research Methods, 11, 614-630.
Reja, U., Manfreda, K. L., Hlebec, V., & Vehovar, V. (2003). Open-ended vs. close-ended questions in web questionnaires. Developments in Applied Statistics, 19, 159-177.
Rogelberg, S., G., Fisher, G. G., Maynard, D. C., Hakel, M. D., & Hovath, M. (2001). Attitudes toward surveys: Development of a measure and its relationship to respondent behavior. Organizational Research Methods, 4, 3-25.
Smyth, J. D., Dillman, D. A., Christian, L. M., & McBride, M. (2009). Open-ended questions in web surveys: Can increasing the size of answer boxes and providing extra verbal instructions improve response quality? Public Opinion Quarterly, 73, 325-337.

Monday, January 13, 2014

Aligning Your Research

Published manuscripts separate portions of research into distinct subsections: introduction, method, results, discussion. This separation -- and the order of these subsections -- implies that each truly is distinct. In other words, first you develop theory, then determine the best method, then figure out how to analyze your data, and finally, draw your conclusions. While it is commonly accepted that research should often be theory-driven (or at least replicated, if empirically-driven!), the best research is developed by integrating theory, methods, and analysis, rather than considering each separately.

Two common issues arise when determining theory, methods, and analysis separately. First, researchers sometimes collect data that doesn't truly test their hypotheses of interest. Second, and more commonly, the data collected are not adequate for a desired analysis. For example, a researcher might collect data from individuals in 5 locations for a multilevel model, then learn afterwards that she doesn't have data from enough locations to estimate the effect of location characteristics on individual-level outcomes. Fortunately, this issues are fairly readily avoided if researchers abide by the following:

1. When developing any part of your project, keep the others in mind. While developing theory, think about how you might collect data and test it. While developing your method, think about your theory - to ensure you are collecting data that allows you to test your theory - and your analysis technique. Considering your analysis technique while developing your method will help prevent you from collecting insufficient or inadequate data for rigorous testing. Finally, when determining your analysis technique, make sure it will actually let you test your theory and that it fits the method you are developing. The most important thing to keep in mind throughout all these stages is that methods and analysis are not separate from theory; they are part of your theory.

2. Have a plan - or a co-author! At times, you will want to test theory that might require you to conduct an analysis you've never used before, or implement a method you are unfamiliar with. When this happens, you will want to either learn the analysis/method in advance, or line up a co-author who knows the analysis/method. Ignorance, in this case, will lead to flawed data collection - a waste of your time and the participants'. As long as you or a co-author knows the method/analysis in question, you can design your study more optimally.

3. Keep it simple. It's hard to resist sexy techniques. Many researchers hear of a new technique - usually one they haven't used before (see 2 above!) - and decide they need to implement it in their research. Sometimes, these 'hot' techniques are exactly what you need to test your theory. Often, however, a simple t-test, ANOVA, or regression will suffice. Ultimately, the best technique is the simplest possible technique that lets you test your theory rigorously.

Are there other techniques you recommend to ensure that theory, methods, and analysis are aligned?

Monday, January 6, 2014

Why Plot Your Data as Part of Your Analysis?

Suppose you wanted to compare the following dataset from four independent experiments, where 'X' was the continuous independent variable and 'y' was the continuous dependent variable:

Dataset 1
X y
1 10.00 8.04
2 8.00 6.95
3 13.00 7.58
4 9.00 8.81
5 11.00 8.33
6 14.00 9.96
7 6.00 7.24
8 4.00 4.26
9 12.00 10.84
10 7.00 4.82
11 5.00 5.68
Dataset 2
X y
1 10.00 9.14
2 8.00 8.14
3 13.00 8.74
4 9.00 8.77
5 11.00 9.26
6 14.00 8.10
7 6.00 6.13
8 4.00 3.10
9 12.00 9.13
10 7.00 7.26
11 5.00 4.74
Dataset 3
X y
1 10.00 7.46
2 8.00 6.77
3 13.00 12.74
4 9.00 7.11
5 11.00 7.81
6 14.00 8.84
7 6.00 6.08
8 4.00 5.39
9 12.00 8.15
10 7.00 6.42
11 5.00 5.73
Dataset 4
X y
1 8.00 6.58
2 8.00 5.76
3 8.00 7.71
4 8.00 8.84
5 8.00 8.47
6 8.00 7.04
7 8.00 5.25
8 19.00 12.50
9 8.00 5.56
10 8.00 7.91
11 8.00 6.89

After taking a quick look at the numbers, the datasets seem relatively similar, so you decide to compare the means and variances for these data and you obtain the following:

dataset Mean.x Variance.x Mean.y Variance.y
1 1 9.00 11.00 7.50 4.13
2 2 9.00 11.00 7.50 4.13
3 3 9.00 11.00 7.50 4.12
4 4 9.00 11.00 7.50 4.12

Now the data seem very similar. But what about the effect of the independent variable? To find out, you compare simple regressions of y on X for each of the datasets:

Statistical models
Model 1
(Intercept) 3.00*
(1.12)
X 0.50**
(0.12)
R2 0.67
Adj. R2 0.63
Num. obs. 11
***p < 0.001, **p < 0.01, *p < 0.05
Statistical models
Dataset 1 Dataset 2 Dataset 3 Dataset 4
(Intercept) 3.00* 3.00* 3.00* 3.00*
(1.12) (1.13) (1.12) (1.12)
X 0.50** 0.50** 0.50** 0.50**
(0.12) (0.12) (0.12) (0.12)
R2 0.67 0.67 0.67 0.67
Adj. R2 0.63 0.63 0.63 0.63
Num. obs. 11 11 11 11
***p < 0.001, **p < 0.01, *p < 0.05

The means, variances, regression coefficients, and effect sizes are nearly identical! These data must be nearly identical... except they're not: the relationship between the X and y is very different for each set. While there is a generally linear relationship between the variables in Dataset 1, the relationship is polynomial in Dataset 2. Dataset 2 and Dataset 3 both have a single extreme outlier, though in different forms.

Anscombe's Quartet

Francis Anscombe created these data in a 1973 paper to demonstrate the importance of data visualization. Though the datasets have nearly identical statistical properties, they in fact represent 4 very different relationships between X and y. Even with only 11 observations in each dataset, it is relatively difficult to see this from just looking at the numbers, and would be almost impossible if one was looking at an SPSS or Excel spreadsheet with hundreds of observations. Furthermore, depending on the experimental hypothesis, these different relationships would each require a different statistical technique in order to properly analyze the data.

For these reasons, it's a good idea to always visualize your data as part of your analysis, not just as a last step in preparation for a poster or publication. It can help you to select the right analysis, and to avoid making poor statistical inferences!

Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17-21.

The Anscombe Quartet in R


require(reshape2)
require(data.table)


# Get the Anscombe Data
dt.anscombe <- data.table(Dataset=rep(c("Dataset 1", "Dataset 2", 
                                        "Dataset 3", "Dataset 4"),
                                      each=11),
                          X=unlist(anscombe[,1:4]),
                          y=unlist(anscombe[,5:8]),key="Dataset")


# Summary Stats
## Means
dt.anscombe[,lapply(.SD[,list(X,y)],mean),by=Dataset]
## Variances
dt.anscombe[,lapply(.SD[,list(X,y)],var),by=Dataset]


# Plot It
ggplot(data=dt.anscombe, aes(X,y)) + 
  geom_smooth(method='lm', fullrange=TRUE, 
              se=FALSE, color="steelblue") +
  geom_point(color="firebrick") + 
  facet_wrap(~Dataset) + 
  theme_bw()
  labs(title="Anscombe's Quartet")

Wednesday, January 1, 2014

New statistical analysis software announcement - "xxM"

A new stats package is now available (for free), which may be especially useful to anyone who works with multi-level structural equation modeling. "xxM implements a modeling framework called n-Level Structural Equation Modeling (NL-SEM) and can estimate models with any number of levels.  Observed and latent variables are allowed at all levels."

The xxM package runs off the existing R or R Studio software.

See more information here: http://xxm.times.uh.edu/ and get started here: http://xxm.times.uh.edu/get-started/

Happy new year everyone!