Personal/Employee
Program Manager - 1
Assistant Program Manager (Graduate Student) - 3
Undergraduate Tutors - 55
http://www.rio.edu/human-resources/documents/OH-Minimum_WagePoster2017.pdf
Min wage = $8.15 + $3 for gas = $11.15 per hour
Transportation (gas for tutors to get to schools) (The transportation/gas is the costs for the bus passes)
Part time for assistant program manager ($15/hr @ 20 hrs per weeks)
=$15/hr*20hr/week*52weeks/year=$15,600 per assistant program manager *3 =$46,600 assistant program manager pay per year
$10/hr*5hr/week*30wks/year*55tutors = $82,500 tutor pay allocation per year
Equipment
For this program, it is currently estimated that there will be a 267 student enrollment from all collective Columbus City Schools. We get this from 12.8% identified as gifted, divided by 2 thats 6.4%, with 50k students that's 3200 gifted students, divide that by 12 to get approximate number per grade and you get 267 students. Each student will be given a Inspiron 3000 11" 2-in-1 laptop to use throughout the semester. These laptops will be purchased from the TechHub at OSU and the total cost of each laptop is $560 (TechHub 2017). Additionally, a 4-year protection plan costing $220 per laptop will be bought for each laptop through the TechHub, to cover any damages that may occur to the laptops throughout the program (TechHub 2017). This totals to $117,000. This will be a one time purchase because the laptops are expected to be reused throughout the entire duration of the program. Additional laptops can be bought if student enrollment increases and budget allows.
Lastly, equipment will need to be purchased for the employees of this program. Each tutor, assistant program manager and program manager will receive a Surface Laptop from the TechHub. These laptops were chosen because they have multiple features that make them great for collaboration. Each laptop is priced at $999.99 with each receiving a $220 protection plan (TechHub 2017). With an estimated 55 tutors, 3 assistant program managers and 1 program manager, this equates to $71979.41. This equipment is a one time purchase because they will belong to the program and will be used for program purposes only.
1. Magnitude.
2. Sign (±). The magnitude refers to the strength of the linear relationship between variable X and variable Y. The rXY ranges in values from −1.00 to +1.00. To determine magnitude, ignore the sign of the correlation, and the absolute value of rXY indicates the extent to which variable X and variable Y are linearly related. For correlations
close to 0, there is no linear relationship. As the correlation approaches either −1.00 or +1.00, the magnitude of the correlation increases. Therefore, for example, |.65| > |.25|, or the magnitude of r = −.65 is greater than the magnitude of r = +.25.
In contrast to magnitude, the sign of a non-zero correlation is either negative or positive. These labels are not interpreted as "bad" or "good." Instead, the sign represents the slope of the linear relationship between X and Y. A scatter plot is used to visualize the slope of this linear relationship, and it is a two-dimensional graph with dots representing the combined X, Y score. Interpreting scatter plots is necessary to check the assumptions of correlation discussed below.
A positive correlation indicates that as values of X increase, the values of Y also increase (for example, height and weight). Figures 7.1 to 7.4 (pp. 262–263) in the Warner text illustrate positive correlations showing that as values of X increase on the horizontal axis, values of Y increase on the vertical axis. The magnitude of the correlations range from a perfect positive linear relationship of r = +1.00 in Figure 7.1 to a weak positive correlation of r = +.23 in Figure 7.4. Conversely, a negative correlation indicates that as values of X increase, the values of Y decrease (see Figure 7.6 on p. 265 representing r = −.75). Finally, when X and Y values are randomly distributed on the scatter plot (that is, there is no linear relationship), then r = 0.00 (see Figure 7.5 on p. 264).
Assumptions of Correlation
All inferential statistics, including correlation, operate under assumptions that are checked prior to interpreting analyses. Violations of assumptions can lead to erroneous inferences regarding a null hypothesis. The first assumption of correlation is independence of observations for X and Y scores. The measurement of individual X and Y scores should not be influenced by errors in measurement or problems in research design. (For example, a student completing an IQ test should not be looking over the shoulder of another student taking that test; his or her IQ score should be independent.) This first assumption of correlation is not statistical in nature; it is controlled by using reliable and valid instruments and by maintaining proper research procedures to maintain independence of observations.
Unit 5 - Correlation: Theory and Logic INTRODUCTION The first inferential statistic we will focus on is correlation, denoted r , which estimates the strength of a linear association between two variables.
Interpreting Correlation: Magnitude and Sign
Interpreting a Pearson's correlation coefficient ( rXY) requires an understanding of two concepts:
The second assumption is that, for Pearson's r, X and Y are quantitative and each variable is normally distributed. Other correlations discussed below do not require this assumption, but Pearson's r is the most widely used and reported type of correlation. It is therefore important to check this assumption when calculating Pearson's r in SPSS. This assumption is checked by a visual inspection of X and Y histograms and calculations of skew and kurtosis values.
The third assumption of correlation is that X and Y scores are linearly related. Correlation does not detect strong curvilinear relationships as shown in Figure 7.7 of the Warner text (p. 268). This assumption is checked by a visual inspection of the X, Y scatter plot.
The fourth assumption of correlation is that the X and Y scores should not have extreme bivariate outliers that influence the magnitude of the correlation. Bivariate outliers are also detected by a visual examination of a scatter plot. Figures 7.10 and 7.11 in the Warner text (pp. 272–273) illustrate how outliers can dramatically influence the magnitude of the correlation, which sometimes leads to errors in null hypothesis testing. Bivariate outliers are particularly problematic when a sample size is small, and Warner (2013) suggests an N of at least 100 for studies that report correlations.
The fifth assumption of correlation is that the variability in Y scores is uniform across levels of X. This requirement is referred to as the homogeneity of variance assumption, which is usually difficult to assess in scatter plots with a small sample size. Sometimes a potential violation can be detected, such as in Figure 4.43 of the Warner text (p. 169), but this assumption is typically emphasized when checking the homogeneity of variance for a t test or analysis of variance (ANOVA) studied later in the course.
Hypothesis Testing of Correlation
The null hypothesis for correlation predicts no significant linear relationship between X and Y, or H0: rXY = 0. A
directional alternative hypothesis for correlation is either an expected significant positive relationship ( H1: rXY >
0) or significant negative relationship ( H1: rXY < 0). A non-directional alternative hypothesis would simply
predict that the correlation is significantly different from 0, but it does not stipulate the sign of the relationship ( H1: rXY ≠ 0).
For correlation as well as t tests and ANOVA studied later in the course, the standard alpha level for rejecting the null hypothesis is set to .05. SPSS output for a correlation showing a p value of less than .05 indicates that the null hypothesis should be rejected; there is a significant relationship between X and Y. A p value greater than .05 indicates that the null hypothesis should not be rejected; there is not a significant relationship between X and Y.
Effect Size in Correlation
Even if the null hypothesis is rejected, how large is the association between X and Y? To provide additional context, the interpretation of all inferential statistics, including correlation, should include an estimate of effect size. An effect size is articulated along a continuum from "small" to "medium" to "large." Effect sizes allow the
researcher to properly interpret the data. For a Pearson's correlation, effect size is expressed by either r or r2. The proportion of variance shared between the two variables is expressed by squaring the correlation
coefficient ( r2). This is called the coefficient of determination and is a measure of the amount of variation
explained by the association. For example, if the correlation between X and Y is r = .40, then r2 = .16, meaning that approximately 16% of the variance in Y (the dependent variable) is explained by X (the independent variable).
Page 298 in the Warner text provides guidelines on the interpretation of r and r2. Roughly speaking, a
correlation less than or equal to .10 ( r2 = .01) is "small," a correlation of .30 ( r2 = .09) is "medium," and a
correlation above .50 ( r2 = .25) is "large." It is important to interpret correlation with this in mind, as it is possible to have a significant correlation (because correlation is partially dependent on sample size) and still have a small effect size (which is calculated independently of sample size).
Alternative Correlation Coefficients
Chapter 7 of the Warner text focuses on the most widely used correlation, referred to as Pearson's r. Pearson's r is calculated between X and Y variables that are measured on either the interval or ratio scale of measurement (for example, height and weight). Chapter 8 of the Warner text reviews other types of correlation that depend on other scales of measurement for X and Y. A point biserial ( rpb) correlation is calculated when one variable is
dichotomous (such as gender) and the other variable is interval/ratio data (such as weight). If both variables are ranked (ordinal) data, the correlation is referred to as Spearman's r ( rs). Although the underlying scales of
measurement differ from the standard Pearson's r, rpb and rs are both calculated between −1.00 and +1.00 and
are interpreted similarly.
If both variables are dichotomous, the correlation is referred to as phi (φ). A final test of association is referred to as chi-square. Phi and chi-square are studied in advanced inferential statistics.
Reference
Warner, R. M. (2013). Applied statistics: From bivariate through multivariate techniques (2nd ed.). Thousand Oaks, CA: Sage.
OBJECTIVES
To successfully complete this learning unit, you will be expected to:
1. Analyze the interpretation of correlation coefficients.
2. Identify the assumptions of correlation.
3. Identify null hypothesis testing of correlation.
4. Interpret a correlation reported in the scientific literature.
5. Analyze the assumptions of correlation.
Unit 5 Study 1
Use your Warner text, Applied Statistics: From Bivariate Through Multivariate Techniques, to complete the following:
• Read Chapter 7, "Bivariate Pearson Correlation," pages 261–314. This chapter addresses the following topics:
◦ Assumptions of Pearson's r. ◦ Preliminary data screening for Pearson's r. ◦ Statistical significance tests for Pearson's r. ◦ Factors influencing the magnitude and sign of Pearson's r. ◦ Effect-size indexes. ◦ Interpretation of Pearson's r values.
• Read Chapter 8, "Alternative Correlation Coefficients," pages 315–343. This chapter addresses the following topics:
◦ Correlations for rank or ordinal scores. ◦ Correlations for true dichotomies. ◦ Correlations for artificial dichotomies. ◦ Chi-square test of association.
Readings
Walk, M., & Rupp, A. (2010). Pearson product-moment correlation coefficient. In N. J. Salkind (Ed.),
Encyclopedia of research design (pp. 1023–1026). Thousand Oaks, CA: Sage. doi:10.4135/9781412961288.n309
Resources
IBM SPSS Step-by-Step Guide: Correlations.
Unit 5 Discussion 1 Correlation Versus Causation • If correlation does not imply causation, what does it imply?
• Are there ever any circumstances when a correlation can be interpreted as evidence for a causal connection between two variables?
• If yes, what circumstances?
Chapter 7 - BIVARIATE PEARSON CORRELATION
7.1 Research Situations Where Pearson’s r Is Used
Pearson’s r is typically used to describe the strength of the linear relationship between two quantitative variables. Often, these two variables are designated X (predictor) and Y (outcome). Pearson’s r has values that range from −1.00 to +1.00. The sign of r provides information about the direction of the relationship between X and Y. A positive correlation indicates that as scores on X increase, scores on Y also tend to increase; a negative correlation indicates that as scores on X increase, scores on Y tend to decrease; and a correlation near 0 indicates that as scores on X increase, scores on Y neither increase nor decrease in a linear manner. As an example, consider the hypothetical data in Figure 7.1 . Suppose that a time-share sales agency pays each employee a base wage of $10,000 per year and, in addition, a commission of $1,500 for each sale that the employee completes. An employee who makes zero sales would earn $10,000; an employee who sold four time-shares would make $10,000 + $1,500 × 4 = $16,000. In other words, for each one-unit increase in the number of timeshares sold (X), there is a $1,500 increase in wages. Figure 7.1 illustrates a perfect linear relationship between number of units sold (X1) and wages in dollars (Y1).
The absolute magnitude of Pearson’s r provides information about the strength of the linear association between scores on X and Y. For values of r close to 0, there is no linear association between X and Y. When r = +1.00, there is a perfect positive linear association ; when r = −1.00, there is a perfect negative linear association . Intermediate values of r correspond to intermediate strength of the relationship. Figures 7.2 through 7.5 show examples of data for which the correlations are r = +.75, r = +.50, r = +.23, and r = .00.
Pearson’s r is a standardized or unit-free index of the strength of the linear relationship between two variables. No matter what units are used to express the scores on the X and Y variables, the possible values of Pearson’s r range from –1 (a perfect negative linear relationship) to +1 (a perfect positive linear relationship). Consider, for example, a correlation between height and weight. Height could be measured in inches, centimeters, or feet; weight could be measured in ounces, pounds, or kilograms. When we correlate scores on height and weight for a given sample of people, the correlation has the same value no matter which of these units are used to measure height and weight. This happens because the scores on X and Y are converted to z scores (i.e., they are converted to unit-free or standardized distances from their means) during the computation of Pearson’s r.
Figure 7.1 Scatter Plot for a Perfect Linear Relationship, r = +1.00 (Y1 = 10,000 + 1,500 × X1; e.g., for X1 = 4, Y1 = 16,000)
Figure 7.2 Scatter Plot for Correlation r = +.75
Figure 7.3 Scatter Plot for Correlation r = +.50
Figure 7.4 Scatter Plot for Correlation r = +.23
Figure 7.5 Scatter Plot for Unrelated Variables With Correlation r = .00
For X and Y to have a perfect correlation of +1, all the X, Y points must lie on a straight line, as shown in Figure 7.1 . Perfect linear relations are rarely seen in real data. When the relationship is perfectly linear, we can make an exact statement about how values of Y change as values of X increase; for Figure 7.1 , we can say that for a 1-unit increase in X1, there is exactly a 1,500-unit increase in Y1. As the strength of the relationship weakens (e.g., r = +.75), we can make only approximate statements about how Y changes for each 1-unit increase in X. In Figure 7.2 , with r = +.75, we can make the (less precise) statement that the mean value of Y2 tends to increase as the value of X2 increases. For example, for relatively low X2 scores (between 30 and 60), the mean score on Y2 is about 15. For relatively high X2 scores (between 80 and 100), the mean score of the Y2 scores is approximately 45. When the correlation is less than 1 in absolute value, we can no longer predict Y scores perfectly from X, but we can predict that the mean of Y will be different for different values of X. In the scatter plot for r = +.75, the points form an elliptical cluster (rather than a straight line). If you look at an individual value of X, you can see that there is a distribution of several different Y values for each X. When we do a correlation analysis, we assume that the amount of change in the Y mean is consistent as we move from X = 1 to X = 2 to X = 3, and so forth; in other words, we assume that X and Y are linearly related.
As r becomes smaller, it becomes difficult to judge whether there is any linear relationship simply from visual examination of the scatter plot. The data in Figure 7.4 illustrate a weak positive correlation (r = +.23). In this graph, it is difficult to see an increase in mean values of Y4 as X4 increases because the changes in the mean of Y across values of X are so small.
Figure 7.6 Scatter Plot for Negative Correlation, r = −.75
Figure 7.5 shows one type of scatter plot for which the correlation is 0; there is no tendency for Y5 scores to be larger at higher values of X5. In this example, scores on X are not related to scores on Y, linearly or nonlinearly. Whether X5 is between 16 and 18, or 18 and 20, or 22 and 24, the mean value of Y is the same (approximately 10). On the other hand, Figure 7.6 illustrates a strong negative linear relationship (r = −.75); in this example, as scores on X6 increase, scores on Y6 tend to decrease.
Pearson correlation is often applied to data collected in nonexperimental studies; because of this, textbooks often remind students that “correlation does not imply causation.” However, it is possible to apply Pearson’s r to data collected in experimental situations. For example, in a psychophysics experiment, a researcher might manipulate a quantitative independent variable (such as the weight of physical objects) and measure a quantitative dependent variable (such as the perceived heaviness of the objects). After scores on both variables are transformed (using power or log transformations), a correlation is calculated to show how perceived heaviness is related to the actual physical weight of the objects; in this example, where the weights of objects are varied by the researcher under controlled conditions, it is possible to make a causal inference based on a large Pearson’s r. The ability to make a causal inference is determined by the nature of the research design, not by the statistic that happens to be used to describe the strength and nature of the relationship between the variables. When data are collected in the context of a carefully controlled experiment, as in the psychophysical research example, a causal inference may be appropriate. However, in many situations where Pearson’s r is reported, the data come from nonexperimental or correlational research designs, and in those situations, causal inferences from correlation coefficients are not warranted.
Despite the inability of nonexperimental studies to provide evidence for making causal inferences, nonexperimental researchers often are interested in the possible existence of causal connections between variables. They often choose particular variables as predictors in correlation analysis because they believe that they might be causal. The presence or absence of a significant statistical association does provide some information: Unless there is a statistical association of some sort between scores on X and Y, it is not plausible to think that these variables are causally related. In other words, the existence of some systematic association between scores on X and Y is a necessary (but not sufficient) condition for making the inference that there is a causal association between X and Y. Significant correlations in nonexperimental research are usually reported merely descriptively, but sometimes the researchers want to show that correlations exist so that they can say that the patterns in their data are consistent with the possible existence of causal connections. It is important, of course, to avoid causal-sounding terminology when the evidence is not strong enough to warrant causal inferences, and so researchers usually limit themselves to saying things such as “X predicted Y” or “X was correlated with Y” when they report data from nonexperimental research.
In some nonexperimental research situations, it makes sense to designate one variable as the predictor and the other variable as the outcome. If scores on X correspond to events earlier in time than scores on Y or if there is reason to think that X might cause Y, then researchers typically use the scores on X as predictors. For example, suppose that X is an assessment of mother/infant attachment made when each participant is a few months old, and Y is an assessment of adult attachment style made when each participant is 18 years old. It would make sense to predict adult attachment at age 18 from infant attachment; it would not make much sense to predict infant attachment from adult attachment. In many nonexperimental studies, the X and Y variables are both assessed at the same point in time, and it is unclear whether X might cause Y, Y might cause X, or whether both X and Y might be causally influenced by other variables. For example, suppose a researcher measures grade point average (GPA) and self-esteem for a group of first-year university students. There is no clear justification for designating one of these variables as a predictor; the choice of which variable to designate as the X or predictor variable in this situation is arbitrary.
7.2 Hypothetical Research Example
As a specific example of a question that can be addressed by looking at a Pearson correlation, consider some survey data collected from 118 university students about their heterosexual dating relationships. The variables in this dataset are described in Table 7.1 ; the scores are in a dataset named love.sav. Only students who were currently involved in a serious dating relationship were included. They provided several kinds of information, including their own gender, partner gender, and a single-item rating of attachment style. They also filled out Sternberg’s Triangular Love Scale (Sternberg, 1997). Based on answers to several questions, total scores were calculated for the degree of intimacy, commitment, and passion felt toward the current relationship partner.
Table 7.1 Description of “Love” Dataset in the File Named love.sav
NOTE: N − 118 college student participants (88 female, 30 male).
Later in the chapter, we will use Pearson’s r to describe the strength of the linear relationship among pairs of these variables and to test whether these correlations are statistically significant. For example, we can ask whether there is a strong positive correlation between scores on intimacy and commitment, as well as between passion and intimacy.
7.3 Assumptions for Pearson’s r
The assumptions that need to be met for Pearson’s r to be an appropriate statistic to describe the relationship between a pair of variables are as follows:
1. Each score on X should be independent of other X scores (and each score on Y should be independent of other Y scores). For further discussion of the assumption of independence among observations and the data collection methods that tend to create problems with this assumption, see Chapter 4 .
2. Scores on both X and Y should be quantitative and normally distributed. Some researchers would state this assumption in an even stronger form: Adherents to strict measurement theory would also require scores on X and Y to be interval/ratio level of measurement. In practice, Pearson’s r is often applied to data that are not interval/ratio level of measurement; for example, the differences between the scores on 5-point rating scales of attitudes probably do not represent exactly equal differences in attitude strength, but it is common practice for researchers to apply Pearson’s r to this type of variable. Harris (2001) summarized arguments about this issue and concluded that it is more important that scores be approximately normally distributed than that the variables satisfy the requirement of true equal interval level of measurement. This does not mean that we should completely ignore issues of level of measurement (see Chapter 1 for further comment on this controversial issue). We can often obtain useful information by applying Pearson’s r even when the data are obtained by using measurement methods that may fall short of the requirements for true equal interval differences between scores; for example, it is common to apply Pearson correlation to scores obtained using 5-point Likert-type rating scales. Pearson’s r can also be applied to data where X or Y (or both) are true dichotomous variables—that is, categorical variables with just two possible values; in this case, it is called a phi coefficient (Φ) . The phi coefficient and other alternate forms of correlation for dichotomous variables are discussed in Chapter 8 .
3. Scores on Y should be linearly related to scores on X. Pearson’s r does not effectively detect curvilinear or nonlinear relationships. An example of a curvilinear relationship between X and Y variables that would not be well described by Pearson’s r appears in Figure 7.7 .
4. X, Y scores should have a bivariate normal distribution. Three-dimensional representations of the bivariate normal distribution were shown in Figures 4.40 and 4.41 , and the appearance of a bivariate normal distribution in a two-dimensional X, Y scatter plot appears in Figure 7.8 . For each value of X, values of Y should be approximately normally distributed. This assumption also implies that there should not be extreme bivariate outliers. Detection of bivariate outliers is discussed in the next section (on preliminary data-screening methods for correlation).
5. Scores on Y should have roughly equal or homogeneous variance across levels of X (and vice versa). Figure 7.9 is an example of data that violate this assumption; the variance of the Y scores tends to be low for small values of X (on the left-hand side of the scatter plot) and high for large values of X (on the right-hand side of the scatter plot).
Figure 7.7 Scatter Plot for Strong Curvilinear Relationship (for These Data, r = .02)
Figure 7.8 Scatter Plot That Shows a Bivariate Normal Distribution for X and Y
SOURCE: www.survo.fi/gallery/010.html
Figure 7.9 Scatter Plot With Heteroscedastic Variance
7.4 Preliminary Data Screening
General guidelines for preliminary data screening were given in Chapter 4 . To assess whether the distributions of scores on X and Y are nearly normal, the researcher can examine a histogram of the scores for each variable. As described in Chapter 4 , most researchers rely on informal visual examination of the distributions to judge normality.
The researcher also needs to examine a bivariate scatter plot of scores on X and Y to assess whether the scores are linearly related, whether the variance of Y scores is roughly uniform across levels of X, and whether there are bivariate outliers. A bivariate outlier is a score that is an unusual combination of X, Y values; it need not be extreme on either X or Y, but in the scatter plot, it lies outside the region where most of the other X, Y points are located. Pearson’s r can be an inaccurate description of the strength of the relationship between X and Y when there are one or several bivariate outliers. As discussed in Chapter 4 , researchers should take note of outliers and make thoughtful decisions about whether to retain, modify, or remove them from the data. Figure 7.10 shows an example of a set of N = 50 data points; when the extreme bivariate outlier is included (as in the upper panel), the correlation between X and Y is +.64; when the correlation is recalculated with this outlier removed (as shown in the scatter plot in the lower panel), the correlation changes to r = −.10. Figure 7.11 shows data for which a single bivariate outlier deflates the value of Pearson’s r; when the circled data point is included, r = +.53; when it is omitted, r = +.86. It is not desirable to have the outcome of a study depend on the behavior represented by a single data point; the existence of this outlier makes it difficult to evaluate the relationship between the X and Y variables. It would be misleading to report a correlation of r = +.64 for the data that appear in Figure 7.10 without including the information that this large positive correlation would be substantially reduced if one bivariate outlier was omitted. In some cases, it may be more appropriate to report the correlation with the outlier omitted.
It is important to examine a scatter plot of the X, Y scores when interpreting a value of r. A scatter plot makes it possible to assess whether violations of assumptions of r make the Pearson’s r value a poor index of relationship; for instance, the scatter plot can reveal a nonlinear relation between X and Y or extreme outliers that have a disproportionate impact on the obtained value of r. When Pearson correlation is close to zero, it can mean that there is no relationship between X and Y, but correlations close to zero can also occur when there is a nonlinear relationship.
For this example, histograms of scores on the two variables (commitment and intimacy) were obtained by selecting the <Graph> → <Histogram> procedure; SPSS menu selections for this were outlined in Chapter 4 . An optional box in the Histogram dialog window can be checked to obtain a normal curve superimposed on the histogram; this can be helpful in assessment of the distribution shape.
The histograms for commitment and intimacy (shown in Figures 7.12 and 7.13 ) do not show perfect normal distribution shapes; both distributions were skewed. Possible scores on these variables ranged from 15 to 75; most people rated their relationships near the maximum value of 75 points. Thus, there was a ceiling effect such that scores were compressed at the upper end of the distribution. Only a few people rated their relationships low on commitment and intimacy, and these few low scores were clearly separate from the body of the distributions. As described in Chapter 4 , researchers need to take note of outliers and decide whether they should be removed from the data or recoded. However, these are always judgment calls. Some researchers prefer to screen out and remove unusually high or low scores, as these can have a disproportionate influence on the size of the correlation (particularly in small samples). Some researchers (e.g., Tabachnick & Fidell, 2007) routinely recommend the use of transformations (such as logs) to make nonnormal distribution shapes more nearly normal. (It can be informative to “experiment” with the data and see whether the obtained correlation changes very much when outliers are dropped or transformations are applied.) For the analysis presented here, no transformations were applied to make the distribution shapes more nearly normal; the r value was calculated with the outliers included and also with the outliers excluded.
The bivariate scatter plot for self-reported intimacy and commitment (in Figure 7.14 ) shows a positive, linear, and moderate to strong association between scores; that is, persons who reported higher scores on intimacy also reported higher scores on commitment. Although the pattern of data points in Figure 7.14 does not conform perfectly to the ideal bivariate normal distribution shape, this scatter plot does not show any serious problems. X and Y are approximately linearly related; their bivariate distribution is not extremely different from bivariate normal; there are no extreme bivariate outliers; and while the variance of Y is somewhat larger at low values of X than at high values of X, the differences in variance are not large.
Figure 7.10 A Bivariate Outlier That Inflates the Size of r
NOTE: With the bivariate outlier included, Pearson’s r(48) = +. 64, p < .001; with the bivariate outlier removed, Pearson’s r(47) = −.10, not significant.
Figure 7.11 A Bivariate Outlier That Deflates the Size of r
NOTE: With the bivariate outlier included, Pearson’s r(48) = +.532, p < .001; with the bivariate outlier removed, Pearson’s r(47) = +.86, p < .001.
Figure 7.12 Data Screening: Histogram of Scores for Commitment
NOTE: Descriptive statistics: Mean = 66.63, SD = 8.16, N = 118.
Figure 7.13 Data Screening: Histogram of Scores for Intimacy
NOTE: Descriptive statistics: Mean = 68.04, SD = 7.12, N − 118.
Figure 7.14 Scatter Plot for Prediction of Commitment From Intimacy
7.5 Design Issues in Planning Correlation Research
Several of the problems at the end of this chapter use data with very small numbers of cases so that students can easily calculate Pearson’s r by hand or enter the data into SPSS. However, in general, studies that report Pearson’s r should be based on fairly large samples. Pearson’s r is not robust to the effect of extreme outliers, and the impact of outliers is greater when the N of the sample is small. Values of Pearson’s r show relatively large amounts of sampling error across different batches of data, and correlations obtained from small samples often do not replicate well. In addition, fairly large sample sizes are required so that there is adequate statistical power for the detection of differences between different correlations. Because of sampling error, it is not realistic to expect sample correlations to be a good indication of the strength of the relationship between variables in samples smaller than N = 30. When N is less than 30, the size of the correlation can be greatly influenced by just one or two extreme scores. In addition, researchers often want to choose sample sizes large enough to provide adequate statistical power (see Section 7.1 ). It is advisable to have an N of at least 100 for any study where correlations are reported.
It is extremely important to have a reasonably wide range of scores on both the predictor and the outcome variables. In particular, the scores should cover the range of behaviors to which the researcher wishes to generalize. For example, in a study that predicts verbal Scholastic Aptitude Test (VSAT) scores from GPA, a researcher might want to include a wide range of scores on both variables, with VSAT scores ranging from 250 to 800 and GPAs that range from very poor to excellent marks.
A report of a single correlation is not usually regarded as sufficient to be the basis of a thesis or a publishable paper (American Psychological Association, 2001, p. 5). Studies that use Pearson’s r generally include correlations among many variables and may include other analyses. Sometimes researchers report correlations among all possible pairs of variables; this often results in reporting hundreds of correlations in a single paper. This leads to an inflated risk of Type I error. A more thoughtful and systematic approach involving the examination of selected correlations is usually preferable (as discussed in Chapter 1 ). In exploratory studies, statistically significant correlations that are detected by examining dozens or hundreds of tests need to be replicated through cross-validation or new data collection before they can be treated as “findings.”
7.6 Computation of Pearson’s r
The version of the formula for the computation of Pearson’s r 1 that is easiest to understand conceptually is as follows:
where zX = (X − MX)/sX, zY = (Y − MY)/sY, and N = number of cases (number of X, Y pairs of observations).
Alternative versions of this formula are easier to use and give less rounding error when Pearson’s r is calculated by hand. The version of the formula above is more helpful in understanding how the Pearson’s r value can provide information about the spatial distribution of X, Y data points in a scatter plot. This conceptual formula can be used for by-hand computation; it corresponds to the following operations. First, each X and Y score is converted to a standard score or z score; then, for each participant, zX is multiplied by zY; these products are summed across all participants; and finally, this sum is divided by the number of participants. The resulting value of r falls within the range −1.00 ≤ r ≤ +1.00.
Because we convert X and Y to standardized or z scores, the value of r does not depend on the units that were used to measure these variables. If we take a group of subjects and express their heights in both inches (X1) and centimeters (X2) and their weights in pounds (Y1) and kilograms (Y2), the correlation between X1, Y1 and between X2, Y2 will be identical. In both cases, once we convert height to a z score, we are expressing the individual’s height in terms of a unit-free distance from the mean. A person’s z score for height will be the same whether we work with height in inches, feet, or centimeters.
Another formula for Pearson’s r is based on the covariance between X and Y:
where MX is the mean of the X scores, MY is the mean of the Y scores, and N is the number of X, Y pairs of scores.
Note that the variance of X is equivalent to the covariance of X with itself:
Pearson’s r can be calculated from the covariance of X with Y as follows:
A covariance, 2 like a variance, is an arbitrarily large number; its size depends on the units used to measure the X and Y variables. For example, suppose a researcher wants to assess the relation between height (X) and weight (Y). These can be measured in many different units: Height can be given in inches, feet, meters, or centimeters, and weight can be given in terms of pounds, ounces, kilograms, or grams. If height is stated in inches and weight in ounces, the numerical scores given to most people will be large and the covariance will turn out to be very large. However, if heights are given in feet and weights in pounds, both the scores and the covariances between scores will be smaller values. Covariance, thus, depends on the units of measurement the researcher happened to use. This can make interpretation of covariance difficult, particularly in situations where the units of measurement are arbitrary.
Pearson correlation can be understood as a standardized covariance: The values of r fall within a fixed range from –1 to +1, and the size of r does not depend on the units of measurement the researcher happened to use for the variables. Whether height was measured in inches, feet, or meters, when the height scores are converted to standard or z scores, information about the units of measurement is lost. Because correlation is standardized, it is easier to interpret, and it is possible to set up some verbal guidelines to describe the sizes of correlations.
Table 7.2 Computation of Pearson’s r for a Set of Scores on Heart Rate (HR) and Self-Reported Tension
NOTE: ∑(zX × zY) = 7.41, Pearson’s r = ∑(zX × zY)/(N = 1) = +7.41/9 = .823
Here is a numerical example that shows the computation of Pearson’s r for a small dataset that contains N = 10 pairs of scores on heart rate (HR) and self-reported tension (see Table 7.2 ).
The first two columns of this table contain the original scores for the variables HR and tension. The next two columns contain the z scores for each variable, zHR and ztension (these z scores or standard scores can be saved as output from the SPSS Descriptive Statistics procedure). For this example, HR is the Y variable and tension is the X variable. The final column contains the product of zX and zY for each case, with ∑(zX × zY) at the bottom. Finally, Pearson’s r was obtained by taking ∑(zX × zY)/(N − 1) = + 7.41/9 = .823. (The values of r reported by SPSS use N − 1 in the divisor rather than N as in most textbook formulas for Pearson’s r. When N is large, for example, N greater than 100, the results do not differ much whether N or N − 1 is used as the divisor.) This value of r agrees with the value obtained by running the SPSS bivariate correlation/Pearson’s r procedure on the data that appear in Table 7.2 .
7.7 Statistical Significance Tests for Pearson’s r
7.7.1 Testing the Hypothesis That ρXY = 0
The most common statistical significance test is for the statistical significance of an individual correlation. The population value of the correlation between X and Y is denoted by the Greek letter rho (ρ). Given an obtained sample r between X and Y, we can test the null hypothesis that ρXY in the population equals 0. The formal null hypothesis that corresponds to the lack of a (linear) relationship between X and Y is
When the population correlation ρXY is 0, the sampling distribution of rs is shaped like a normal distribution (for large N) or a t distribution with N – 2 df (for small N), except that the tails are not infinite (the tails end at +1 and −1); see the top panel in Figure 7.15 . That is, when the true population correlation ρ is actually 0, most sample rs tend to be close to 0; the sample rs tend to be normally distributed, but the tails of this distribution are not infinite (as they are for a true normal distribution), because sample correlations cannot be outside the range of −1 to +1. Because the sampling distribution for this situation is roughly that of a normal or t distribution, a t ratio to test this null hypothesis can be set up as follows:
Figure 7.15 Sampling Distributions for r With N = 12
The value of SEr is given by the following equation:
Substituting this value of SEr from Equation 7.7 into Equation 7.6 and rearranging the terms yields the most widely used formula for a t test for the significance of a sample r value; this t test has N − 2 degrees of freedom (df); the hypothesized value of ρ0 is 0.
It is also possible to set up an F ratio, with (1, N − 2) df, to test the significance of sample r. This F is equivalent to t2; it has the following form:
Programs such as SPSS provide an exact p value for each sample correlation (a two-tailed p value by default; a one-tailed p value can be requested). Critical values of the t and F distributions are provided in Appendixes B and C . It is also possible to look up whether r is statistically significant as a function of degrees of freedom and the r value itself directly in the table in Appendix E (without having to calculate t or F).
7.7.2 Testing Other Hypotheses About ρXY
It is uncommon to test null hypotheses about other specific hypothesized values of ρXY (such as H0: ρXY = .90). For this type of null hypothesis, the sampling distribution of r is not symmetrical and therefore cannot be approximated by a t distribution. For example, if ρXY = .90, then most sample rs will be close to .90; sample rs will be limited to the range from –1 to +1, so the sampling distribution will be extremely skewed (see the bottom panel in Figure 7.15 ). To correct for this nonnormal distribution shape, a data transformation is applied to r before testing hypotheses about nonzero hypothesized values of ρ. The Fisher r to Z transformation rescales sample rs in a way that yields a more nearly normal distribution shape, which can be used for hypothesis testing. (Note that in this book, lowercase z always refers to a standard score; uppercase Z refers to the Fisher Z transformation based on r. Some books label the Fisher Z using a lowercase z or z′.) The r to Fisher Z transformation is shown in Table 7.3 (for reference, it is also included in Appendix G at the end of this book).
The value of Fisher Z that corresponds to a sample Pearson’s r is usually obtained by table lookup, although Fisher Z can be obtained from this formula:
Table 7.3 Transformation of Pearson’s r to Fisher Z
SOURCE: Adapted from Lane (2001).
A Fisher Z value can also be converted back into an r value by using Table 7.3 .
For the Fisher Z, the standard error (SE) does not depend on ρ but only on N; the sampling distribution of Fisher Z scores has this standard error:
Thus, to test the null hypothesis,
With N = 28 and an observed sample r of .8, the researcher needs to do the following:
1. Convert ρhyp to a corresponding Fisher Z value, Zhyp, by looking up the Z value in Table 7.3 . For ρhyp = .90, Zhyp = 1.472.2.
2. Convert the observed sample r (rsample) to a Fisher Z value (Zsample) by looking up the corresponding Fisher Z value in Table 7.3 . For an observed r of .80, Zsample = 1.099.
3. Calculate SEZ from Equation 7.11 :
4. Compute the z ratio as follows:
z = (Zsample – Zhyp)/SEZ = (1.099 – 1.472)/.20 = –1.865.
5. For α = .05, two-tailed, the reject region for a z test is z > 1.96 and z < −1.96; therefore, do not reject the null hypothesis that ρ = .90.
Following sections discuss the ways in which the Fisher Z transformation is also used when testing the null hypothesis that the value of ρ is equal between two different populations, H0: ρ1 = ρ2. For example, we can test whether the correlation between X and Y is significantly different for women versus men. Fisher Z is also used to set up confidence intervals (CIs) for correlation estimates.
7.7.3 Assessing Differences Between Correlations
It can be quite problematic to compare correlations that are based on different samples or populations, or that involve different variables, because so many factors can artifactually influence the size of Pearson’s r (many of these factors are discussed in Section 7.9 ). For example, suppose a researcher wants to evaluate whether the correlation between emotional intelligence (EI) and drug use is stronger for males than for females. If the scores on drug use have a much more restricted range in the female sample than in the male sample, this restricted range of scores in the female sample might make the correlation between these variables smaller for females. If the measurement of drug use has lower reliability for females than for males, this difference in reliability could also artifactually reduce the magnitude of the correlation between EI and drug use in the female sample. If two correlations differ significantly, this difference might arise due to artifact (such as a narrower range of scores used to compute one r) rather than because of a difference in the true strength of the relationship. Researchers have to be very cautious when comparing correlations, and they should acknowledge possible artifacts that might have led to different r values (sources of artifacts are discussed in Section 7.9 ). For further discussion of problems with comparisons of correlations and other standardized coefficients to make inferences about differences in effect sizes across populations, see Greenland, Maclure, Schlesselman, Poole, and Morgenstern (1991) and Greenland, Schlesselman, and Criqui (1986).
It is useful to have statistical significance tests for comparison of correlations; these at least help to answer whether the difference between a pair of correlations is so small that it could very likely be due to sampling error. Obtaining statistical significance is a necessary, but not a sufficient, condition for concluding that a genuine difference in the strength of relationship is present. Two types of comparisons between correlations are described here.
In the first case, the test compares the strength of the correlation between the same two variables in two different groups or populations. Suppose that the same set of variables (such as X = EI and Y = drug abuse or DA) is correlated in two different groups of participants (Group 1 = males, Group 2 = females). We might ask whether the correlation between EI and DA is significantly different for men versus women. The corresponding null hypothesis is
To test this hypothesis, the Fisher Z transformation has to be applied to both sample r values. Let r1 be the sample correlation between EI and DA for males and r2 the sample correlation between EI and DA for females; N1 and N2 are the numbers of participants in the male and female groups, respectively.
First, using Table 7.3 , look up the Z1 value that corresponds to r1 and the Z2 value that corresponds to r2.
Next, apply the following formula:
The test statistic z is evaluated using the standard normal distribution; if the obtained z ratio is greater than +1.96 or less than –1.96, then the correlations r1 and r2 are judged significantly different using α = .05, two-tailed. This test should be used only when the N in each sample is fairly large, preferably N > 100.
A second situation of interest involves comparison of two different predictor variables. Suppose the researcher wants to know whether the correlation of X with Z is significantly different from the correlation of Y with Z. The corresponding null hypothesis is
This test does not involve the use of Fisher Z transformations. Instead, we need to have all three possible bivariate correlations (rXZ, rYZ, and rXY); N = total number of participants. The test statistic (from Lindeman, Merenda, & Gold, 1980) is a t ratio of this form:
The resulting t value is evaluated using critical values from the t distribution with (N − 3) df. Even if a pair of correlations is judged to be statistically significantly different using these tests, the researcher should be very cautious about interpreting this result. Different size correlations could arise because of differences across populations or across predictors in factors that affect the size of r discussed in Section 7.9 , such as range of scores, reliability of measurement, outliers, and so forth.
7.7.4 Reporting Many Correlations: Need to Control Inflated Risk of Type I Error
Journal articles rarely report just a single Pearson’s r; in fact, the Publication Manual of the American Psychological Association (American Psychological Association, 2001) states that this is not sufficient for a reportable study. Unfortunately, however, many studies report such a large number of correlations that evaluation of statistical significance becomes problematic. Suppose that k = 20 variables are measured in a nonexperimental study. If the researcher does all possible bivariate correlations, there will be k × (k − 1)/2 different correlations, in this case (20 × 19)/2 = 190 correlations. When we set our risk of Type I error at α = .05 for the statistical significance test for each individual correlation, this implies that out of 100 statistical significance tests that are done (on data from populations that really have no relationship between the X and Y variables), about 5 tests will be instances of Type I error (rejection of H0 when H0 is true). When a journal article reports 200 correlations, for example, one would expect that about 5% of these (10 correlations) should be statistically significant using the α = .05 significance level, even if the data were completely random. Thus, of the 200 correlations, it is very likely that at least some of the significance tests (on the order of 9 or 10) will be instances of Type I error. If the researcher runs 200 correlations and finds that the majority of them (say, 150 out of 200) are significant, then it seems likely that at least some of these correlations are not merely artifacts of chance. However, if a researcher reports 200 correlations and only 10 are significant, then it’s quite possible that the researcher has found nothing beyond the expected number of Type I errors. It is even more problematic for the reader when it’s not clear how many correlations were run; if a researcher runs 200 correlations, hand selects 10 statistically significant rs after the fact, and then reports only the 10 rs that were judged to be significant, it is extremely misleading to the reader, who is no longer able to evaluate the true magnitude of the risk of Type I error.
In general, it is common in exploratory nonexperimental research to run large numbers of significance tests; this inevitably leads to an inflated risk of Type I error. That is, the probability that the entire research report contains at least one instance of Type I error is much higher than the nominal risk of α = .05 that is used for any single significance test. There are several possible ways to deal with this problem of inflated risk of Type I error.
7.7.4.1 Limiting the Number of Correlations
One approach is to limit the number of correlations that will be examined at the outset, before looking at the data, based on theoretical assumptions about which predictive relations are of interest. The possible drawback of this approach is that it may preclude serendipitous discoveries. Sometimes, unexpected observed correlations do point to relationships among variables that were not anticipated from theory but that can be confirmed in subsequent replications.
7.7.4.2 Cross-Validation of Correlations
A second approach is cross-validation. In a cross-validation study, the researcher randomly divides the data into two batches; thus, if the entire study had data for N = 500 participants, each batch would contain 250 cases. The researcher then does extensive exploratory analysis on the first batch of data and decides on a limited number of correlations or predictive equations that seem to be interesting and useful. Then, the researcher reruns this small set of correlations on the second half of the data. If the relations between variables remain significant in this fresh batch of data, it is less likely that these relationships were just instances of Type I error. The main problem with this approach is that researchers often don’t have large enough numbers of cases to make this possible.
7.7.4.3 Bonferroni Procedure: A More Conservative Alpha Level for Tests of Individual Correlations
A third approach is the Bonferroni procedure. Suppose that the researcher plans to do k = 10 correlations and wants to have an experiment-wise alpha (EWα) of .05. To keep the risk of obtaining at least one Type I error as low as 5% for a set of k = 10 significance tests, it is necessary to set the per-comparison alpha (PCα) level lower for each individual test. Using the Bonferroni procedure, the PCα used to test the significance of each individual r value is set at EWα/k; for example, if EWα = .05 and k = 10, each individual correlation has to have an observed p value less than .05/10, or .005, to be judged statistically significant. The main drawback of this approach is that it is quite conservative. Sometimes the number of correlations that are tested in exploratory studies is quite large (100 or 200 correlations have been reported in some recent papers). If the error rate were adjusted by dividing .05 by 100, the resulting PCα would be so low that it would almost never be possible to judge individual correlations significant. Sometimes the experiment-wise α for the Bonferroni test is set higher than .05; for example, EWα = .10 or .20.
If the researcher does not try to limit the risk of Type I error in any of these three ways (by limiting the number of significance tests, doing a cross-validation, or using a Bonferroni-type correction for alpha levels), then at the very least, the researcher should explain in the write-up that the p values that are reported probably underestimate the true overall risk of Type I error. In these situations, the Discussion section of the paper should reiterate that the study is exploratory, that relationships detected by running large numbers of significance tests are likely to include large numbers of Type I errors, and that replications of the correlations with new samples are needed before researchers can be confident that the relationships are not simply due to chance or sampling error.
7.8 Setting Up CIs for Correlations
If the researcher wants to set up a CI using a sample correlation, he or she must use the Fisher Z transformation (in cases where r is not equal to 0). The upper and lower bounds of the 95% CI can be calculated by applying the usual formula for a CI to the Fisher Z values that correspond to the sample r.
The general formula for a CI is as follows:
To set up a CI around a sample r value (let r = .50, with N = 43, for example), first, look up the Fisher Z value that corresponds to r = .50; from Table 7.3 , this is Fisher Z = .549. For N = 43 and df − N − 3 = 40,
For N = 43 and a 95% CI, tcrit is approximately equal to +2.02 for the top 2.5% of the distribution. The Fisher Z values and the critical values of t are substituted into the equations for the lower and upper bounds:
Lower bound of 95% CI = Fisher Z − 2.02 × SEZ = .549 − 2.02 × .157 = .232, Upper bound of 95% CI = Fisher Z + 2.02 × SEZ = .549 + 2.02 × .157 = .866.
Table 7.3 is used to transform these boundaries given in terms of Fisher Z back into estimated correlation values:
Fisher Z = .232 is equivalent to r = .23, Fisher Z = .866 is equivalent to r = .70.
Thus, if a researcher obtains a sample r of .50 with N = 43, the 95% CI is from .23 to .70. If this CI does not include 0, then the sample r would be judged statistically significant using α = .05, two-tailed. The SPSS Bivariate Correlation procedure does not provide CIs for r values, but these can easily be calculated by hand.
7.9 Factors That Influence the Magnitude and Sign of Pearson’s r
7.9.1 Pattern of Data Points in the X, Y Scatter Plot
To understand how the formula for correlation can provide information about the location of points in a scatter plot and how it detects a tendency for high scores on Y to co-occur with high or low scores on X, it is helpful to look at the arrangement of points in an X, Y scatter plot (see Figure 7.16 ). Consider what happens when the scatter plot is divided into four quadrants or regions: scores that are above and below the mean on X and scores that are above and below the mean on Y.
The data points in Regions II and III are cases that are “concordant”; these are cases for which high X scores were associated with high Y scores or low X scores were paired with low Y scores. In Region II, both zX and zY are positive, and their product is also positive; in Region III, both zX and zY are negative, so their product is also positive. If most of the data points fall in Regions II and/or III, it follows that most of the contributions to the ∑ (zX × zY) sum of products will be positive, and the correlation will tend to be large and positive.
Figure 7.16 X, Y Scatter Plot Divided Into Quadrants (Above and Below the Means on X and Y)
The data points in Regions I and IV are “discordant” because these are cases where high X went with low Y and/or low X went with high Y. In Region I, zX is negative and zY is positive; in Region IV, zX is positive and zY is negative. This means that the product of zX and zY for each point that falls in Region I or IV is negative. If there are a large number of data points in Region I and/or IV, then most of the contributions to ∑(zX × zY) will be negative, and r will tend to be negative.
If the data points are about evenly distributed among the four regions, then positive and negative values of zX × zY will be about equally common, they will tend to cancel each other out when summed, and the overall correlation will be close to zero. This can happen because X and Y are unrelated (as in Figure 7.5 ) or in situations where there is a strongly curvilinear relationship (as in Figure 7.7 ). In either of these situations, high X scores are associated with high Y scores about as often as high X scores are associated with low Y scores.
Note that any time a statistical formula includes a product between variables of the form ∑(X × Y) or ∑(zX × zY), the computation provides information about correlation or covariance. These products summarize information about the spatial arrangement of X, Y data points in the scatter plot; these summed products tend to be large and positive when most of the data points are in the upper right and lower left (concordant) areas of the scatter plot. In general, formulas that include sums such as ∑X or ∑Y provide information about the means of variables (just divide by N to get the mean). Terms that involve ∑X2 or ∑Y2 provide information about variability. Awareness about the information that these terms provide makes it possible to decode the kinds of information that are included in more complex computational formulas. Any time a ∑(X × Y) term appears, one of the elements of information included in the computation is covariance or correlation between X and Y.
Correlations provide imperfect information about the “true” strength of predictive relationships between variables. Many characteristics of the data, such as restricted ranges of scores, nonnormal distribution shape, outliers, and low reliability, can lead to over- or underestimation of the correlations between variables. Correlations and covariances provide the basic information for many other multivariate analyses (such as multiple regression and multivariate analysis of variance). It follows that artifacts that influence the values of sample correlations and covariances will also affect the results of other multivariate analyses. It is therefore extremely important for researchers to understand how characteristics of the data, such as restricted range, outliers, or measurement unreliability, influence the size of Pearson’s r, for these aspects of the data also influence the sizes of regression coefficients, factor loadings , and other coefficients used in multivariate models.
7.9.2 Biased Sample Selection: Restricted Range or Extreme Groups
The ranges of scores on the X and Y variables can influence the size of the sample correlation. If the research goal is to estimate the true strength of the correlation between X and Y variables for some population of interest, then the ideal sample should be randomly selected from the population of interest and should have distributions of scores on both X and Y that are representative of, or similar to, the population of interest. That is, the mean, variance, and distribution shape of scores in the sample should be similar to the population mean, variance, and distribution shape.
Suppose that the researcher wants to assess the correlation between GPA and VSAT scores. If data are obtained for a random sample of many students from a large high school with a wide range of student abilities, scores on GPA and VSAT are likely to have wide ranges (GPA from about 0 to 4.0, VSAT from about 250 to 800). See Figure 7.17 for hypothetical data that show a wide range of scores on both variables. In this example, when a wide range of scores are included, the sample correlation between VSAT and GPA is fairly high (r = +.61).
However, samples are sometimes not representative of the population of interest; because of accidentally biased or intentionally selective recruitment of participants, the distribution of scores in a sample may differ from the distribution of scores in the population of interest. Some sampling methods result in a restricted range of scores (on X or Y or both variables). Suppose that the researcher obtains a convenience sample by using scores for a class of honors students. Within this subgroup, the range of scores on GPA may be quite restricted (3.3 – 4.0), and the range of scores on VSAT may also be rather restricted (640 – 800). Within this subgroup, the correlation between GPA and VSAT scores will tend to be smaller than the correlation in the entire high school, as an artifact of restricted range. Figure 7.18 shows the subset of scores from Figure 7.17 that includes only cases with GPAs greater than 3.3 and VSAT scores greater than 640. For this group, which has a restricted range of scores on both variables, the correlation between GPA and VSAT scores drops to +.34. It is more difficult to predict a 40- to 60-point difference in VSAT scores from a .2- or .3-point difference in GPA for the relatively homogeneous group of honors students, whose data are shown in Figure 7.18 , than to predict the 300- to 400-point differences in VSAT scores from 2- or 3-point differences in GPA in the more diverse sample shown in Figure 7.17 .
In general, when planning studies, researchers should try to include a reasonably wide range of scores on both predictor and outcome variables. They should also try to include the entire range of scores about which they want to be able to make inferences because it is risky to extrapolate correlations beyond the range of scores for which you have data. For example, if you show that there is only a small correlation between age and blood pressure for a sample of participants with ages up to 40 years, you cannot safely assume that the association between age and blood pressure remains weak for ages of 50, 60, and 80 (for which you have no data). Even if you find a strong linear relation between two variables in your sample, you cannot assume that this relation can be extrapolated beyond the range of X and Y scores for which you have data (or, for that matter, to different types of research participants).
Figure 7.17 Correlation Between Grade Point Average (GPA) and Verbal Scholastic Aptitude Test (VSAT) Scores in Data With Unrestricted Range (r = +.61)
Figure 7.18 Correlation Between Grade Point Average (GPA) and Verbal Scholastic Aptitude Test (VSAT) Scores in a Subset of Data With Restricted Range (Pearson’s r = +.34)
NOTE: This is the subset of the data in Figure 7.17 for which GPA > 3.3 and VSAT > 640.
A different type of bias in correlation estimates occurs when a researcher purposefully selects groups that are extreme on both X and Y variables. This is sometimes done in early stages of research in an attempt to ensure that a relationship can be detected. Figure 7.19 illustrates the data for GPA and VSAT for two extreme groups selected from the larger batch of data in Figure 7.17 (honors students vs. failing students). The correlation between GPA and VSAT scores for this sample that comprises two extreme groups was r = +.93. Pearson’s r obtained for samples that are formed by looking only at extreme groups tends to be much higher than the correlation for the entire range of scores. When extreme groups are used, the researcher should note that the correlation for this type of data typically overestimates the correlation that would be found in a sample that included the entire range of possible scores. Examination of extreme groups can be legitimate in early stages of research, as long as researchers understand that the correlations obtained from such samples do not describe the strength of relationship for the entire range of scores.
Figure 7.19 Correlation Between Grade Point Average (GPA) and Verbal Scholastic Aptitude Test (VSAT) Scores Based on Extreme Groups (Pearson’s r = +.93)
NOTE: Two subsets of the data in Figure 7.17 (low group, GPA < 1.8 and VSAT < 400; high group, GPA > 3.3 and VSAT > 640).
7.9.3 Correlations for Samples That Combine Groups
It is important to realize that a correlation between two variables (for instance, X = EI and Y = drug use) may be different for different types of people. For example, Brackett et al. (2004) found that EI was significantly predictive of illicit drug use behavior for males but not for females (men with higher EI engaged in less drug use). The scatter plot (of hypothetical data) in Figure 7.20 illustrates a similar but stronger interaction effect —“different slopes for different folks.” In Figure 7.20 , there is a fairly strong negative correlation between EI and drug use for males; scores for males appear as triangular markers in Figure 7.20 . In other words, there was a tendency for males with higher EI to use drugs less. For women (data points shown as circular markers in Figure 7.20 ), drug use and EI were not significantly correlated. The gender differences shown in this graph are somewhat exaggerated (compared with the actual gender differences Brackett et al. [2004] found in their data), to make it clear that there were differences in the correlation between EI and drugs for these two groups (women vs. men).
A spurious correlation can also arise as an artifact of between-group differences. The hypothetical data shown in Figure 7.21 show a positive correlation between height and violent behavior for a sample that includes both male and female participants (r = +.687). Note that the overall negative correlation between height and violence occurred because women were low on both height and violence compared with men; the apparent correlation between height and violence is an artifact that arises because of gender differences on both the variables. Within the male and female groups, there was no significant correlation between height and violence (r = −.045 for males, r = −.066 for females, both not significant). A spurious correlation between height and violence arose when these two groups were lumped together into one analysis that did not take gender into account.
In either of these two research situations, it can be quite misleading to look at a correlation for a batch of data that mixes two or several different kinds of participants’ data together. It may be necessary to compute correlations separately within each group (separately for males and for females, in this example) to assess whether the variables are really related and, if so, whether the nature of the relationship differs within various subgroups in your data.
7.9.4 Control of Extraneous Variables
Chapter 10 describes ways of statistically controlling for other variables that may influence the correlation between an X and Y pair of variables. For example, one simple way to “control for” gender is to calculate X, Y correlations separately for the male and female groups of participants. When one or more additional variables are statistically controlled, the size of the X, Y correlation can change in any of the following ways: It may become larger or smaller, change sign, drop to zero, or remain the same. It is rarely sufficient in research to look at a single bivariate correlation in isolation; it is often necessary to take other variables into account to see how these affect the nature of the X, Y relationship. Thus, another factor that influences the sizes of correlations between X and Y is the set of other variables that are controlled, either through statistical control (in the data analysis) or through experimental control (by holding some variables constant, for example).
Figure 7.20 Scatter Plot for Interaction Between Gender and Emotional Intelligence (EI) as Predictors of Drug Use: “Different Slopes for Different Folks”
NOTE: Correlation between EI and drug use for entire sample is r(248) = −.60, p < .001; correlation within female subgroup (circular markers) is r(112) = −.11, not significant; correlation within male subgroup (triangular markers) is r(134) = −.73, p < .001.
7.9.5 Disproportionate Influence by Bivariate Outliers
Like the sample mean (M), Pearson’s r is not robust against the influence of outliers. A single bivariate outlier can lead to either gross overestimation or gross underestimation of the value of Pearson’s r (refer to Figures 7.10 and 7.11 for visual examples). Sometimes bivariate outliers arise due to errors in data entry, but they can also be valid scores that are unusual combinations of values (it would be unusual to find a person with a height of 72 in. and a weight of 100 lb, for example). Particularly in relatively small samples, a single unusual data value can have a disproportionate impact on the estimate of the correlation. For example, in Figure 7.10, if the outlier in the upper right-hand corner of the scatter plot is included, r = +.64; if that outlier is deleted, r drops to −.10. It is not desirable to have the outcome of a study hinge on the scores of just one or a few unusual participants. An outlier can either inflate the size of the sample correlation (as in Figure 7.10) or deflate it (as in Figure 7.11). For Figure 7.11, the r is +.532 if the outlier in the lower right-hand corner is included in the computation of r; the r value increases to +.86 if this bivariate outlier is omitted.
Figure 7.21 Scatter Plot of a Spurious Correlation Between Height and Violence (Due to Gender Differences)
NOTE: For entire sample, r(248) = +.687, p < .001; male subgroup only, r(134) = −.045, not significant; female subgroup only, r(112) = −.066, not significant.
The impact of a single extreme outlier is much more problematic when the total number of cases is small (N < 30, for example). Researchers need to make thoughtful judgments about whether to retain, omit, or modify these unusual scores (see Chapter 4 for more discussion on the identification and treatment of outliers). If outliers are modified or deleted, the decision rules for doing this should be explained, and it should be made clear how many cases were involved. Sometimes it is useful to report the correlation results with and without the outlier, so that readers can judge the impact of the bivariate outlier for themselves.
As samples become very small, the impact of individual outliers becomes greater. Also, as the degrees of freedom become very small, we get “overfitting.” In a sample that contains only 4 or 5 data points, a straight line is likely to fit rather well, and Pearson’s r is likely to be large in absolute value, even if the underlying relation between variables in the population is not strongly linear.
7.9.6 Shapes of Distributions of X and Y
Pearson’s r can be used as a standardized regression coefficient to predict the standard score on Y from the standard score on X (or vice versa).
This equation is as follows:
Correlation is a symmetric index of the relationship between variables; that is, the correlation between X and Y is the same as the correlation between Y and X. (Some relationship indices are nonsymmetrical; for instance, the raw score slope coefficient to predict scores on Y from scores on X is not generally the same as the raw score coefficient to predict scores on X from scores on Y.) One interpretation of Pearson’s r is that it predicts the position of the score on the Y variable (in standard score or z score units) from the position of the score on the X variable in z score units. For instance, one study found a correlation of about .32 between the height of wives (X) and the height of husbands (Y). The equation to predict husband height (in z score units) from wife height (in z score units) is as follows:
this equation predicts standard scores on Y from standard scores on X. Because r is a symmetrical index of relationship, it is also possible to use it to predict zX from zY:
In words, Equation 7.19 tells us that one interpretation of correlation is in terms of relative distance from the mean. That is, if X is 1 SD from its mean, we predict that Y will be r SD units from its mean. If husband and wife heights are correlated +.32, then a woman who is 1 SD above the mean in the distribution of female heights is predicted to have a husband who is about 1/3 SD above the mean in the distribution of male heights (and vice versa).
Figure 7.22 Mapping of Standardized Scores From zX to z′Y for Three Values of Pearson’s r
A correlation of +1.0 implies an exact one-to-one mapping of locations of Y scores from locations of X scores. An exact one-to-one correspondence of score locations relative to the mean can occur only when the distribution shapes of the X and Y variables are identical. For instance, if both X and Y are normally distributed, it is possible to map Y scores one-to-one on corresponding X scores. If r happens to equal +1, Equation 7.19 for the prediction of from zX implies that there must be an exact one-to-one mapping of score locations. That is, when r = +1, each person’s score on Y is predicted to be the same distance from the mean of Y as his or her score on X was from the mean of X. If the value of r is less than 1, we find that the predicted score on Y is always somewhat closer to the mean than the score on X. See Figure 7.22 for a representation of the mapping from zX to for three different values of r: r = +1, r = +.5, and r = .00.
This phenomenon, that the predicted score on the dependent variable is closer to the mean than the score on the independent variable whenever the correlation is less than 1, is known as regression toward the mean . (We call the equation to predict Y from scores on one or more X variables a “regression equation” for this reason.) When X and Y are completely uncorrelated (r = 0), the predicted score for all participants corresponds to zY = 0 regardless of their scores on X. That is, when X and Y are not correlated, our best guess at any individual’s score on Y is just the mean of Y (MY). If X and Y are positively correlated (but r is less than 1), then for participants who score above the mean on X, we predict scores that are above the mean (but not as far above the mean as the X scores) on Y.
Note that we can obtain a perfect one-to-one mapping of z score locations on the X and Y scores only when X and Y both have the same distribution shape. That is, we can obtain Pearson’s r of +1 only if X and Y both have the same shape. As an example, consider a situation where X has a normal distribution shape and Y has a skewed distribution shape (see Figure 7.23). It is not possible to make a one-to-one mapping of the scores with zX values greater than zX = +1 to corresponding zY values greater than +1; the Y distribution has far more scores with zY values greater than +1 than the X distribution in this example.
Figure 7.23 Failure of One-to-One Mapping of Score Location for Different Distribution Shapes
NOTE: For example, there are more scores located about 1 SD below the mean in the upper/normal distribution than in the lower/skewed distribution, and there are more scores located about 1 SD above the mean in the lower/skewed distribution than in the upper/normal distribution. When the shapes of distributions for the X and Y scores are very different, this difference in distribution shape makes it impossible to have an exact one-to-one mapping of score locations for X and Y; this in turn makes it impossible to obtain a sample correlation close to +1.00 between scores on X and Y.
This example illustrates that when we correlate scores on two quantitative variables that have different distribution shapes, the maximum possible correlation that can be obtained will artifactually be less than 1 in absolute value. Perfect one-to-one mapping (that corresponds to r = 1) can arise only when the distribution shapes for the X and Y variables are the same.
It is desirable for all the quantitative variables in a multivariable study to have nearly normal distribution shapes. However, if you want to compare X1 and X2 as predictors of Y and if the distribution shapes of X1 and X2 are different, then the variable with a distribution less similar to the distribution shape of Y will artifactually tend to have a lower correlation with Y. This is one reason why comparisons among correlations for different predictor variables can be misleading. The effect of distribution shape on the size of Pearson’s r is one of the reasons why it is important to assess distribution shape for all the variables before interpreting and comparing correlations.
Thus, correlations can be artifactually small because the variables that are being correlated have drastically different distribution shapes. Sometimes nonlinear transformations (such as log) can be helpful in making a skewed or exponential distribution appear more nearly normal. Tabachnick and Fidell (2007) recommend using such transformations and/or removing extreme outlier scores to achieve more normal distribution shapes when scores are nonnormally distributed. Some analysts (e.g., Harris, 2001; Tabachnick & Fidell, 2007) suggest that it is more important for researchers to screen their data to make sure that the distribution shapes of quantitative variables are approximately normal in shape than for researchers to worry about whether the scores have true interval/ratio level of measurement properties.
7.9.7 Curvilinear Relations
Pearson’s r can detect only a linear relation between an X and a Y variable. Various types of nonlinear relationships result in very small r values even though the variables may be strongly related; for example, D. P. Hayes and Meltzer (1972) showed that favorable evaluation of speakers varied as a curvilinear function of the proportion of time the person spent talking in a group discussion, with moderate speakers evaluated most positively (refer to Figure 7.7 for an example of this type of curvilinear relation). An r of 0 does not necessarily mean that two variables are unrelated, only that they are not linearly related. When nonlinear or curvilinear relations appear in a scatter plot (for an example of one type of curvilinear relation, see Figure 7.7), Pearson’s r is not an appropriate statistic to describe the strength of the relationship between variables. Other approaches can be used when the relation between X and Y is nonlinear. One approach is to apply a data transformation to one or both variables (e.g., replace X by log(X)). Sometimes a relationship that is not linear in the original units of measurement becomes linear under the right choice of data transformation. Another approach is to recode the scores on the X predictor variable so that instead of having continuous scores, you have a group membership variable (X = 1, low; X = 2, medium; X = 3, high); an analysis of variance (ANOVA) to compare means on Y across these groups can detect a nonlinear relationship such that Y has the highest scores for medium values of X. Another approach is to predict scores on Y from scores on X2 as well as X.
7.9.8 Transformations of Data
Linear transformations of X and Y that involve addition, subtraction, multiplication, or division by constants do not change the magnitude of r (except in extreme cases where very large or very small numbers can result in important information getting lost in rounding error). Nonlinear transformations of X and Y (such as log, square root, X2, etc.) can change correlations between variables substantially, particularly if there are extreme outliers or if there are several log units separating the minimum and maximum data values. Transformations may be used to make nonlinear relations linear, to minimize the impact of extreme outliers, or to make nonnormal distributions more nearly normal.
For example, refer to the data that appear in Figure 7.10. There was one extreme outlier (circled). One way to deal with this outlier is to select it out of the dataset prior to calculating Pearson’s r. Another way to minimize the impact of this bivariate outlier is to take the base 10 logarithms of both X and Y and correlate the log-transformed scores. Taking the logarithm reduces the value of the highest scores, so that they lie closer to the body of the distribution.
An example of data that show a strong linear relationship only after a log transformation is applied to both variables was presented in Chapter 4, Figures 4.47 and 4.48. Raw scores on body mass and metabolic rate showed a curvilinear relation in Figure 4.47; when the base 10 log transformation was applied to both variables, the log-transformed scores were linearly related.
Finally, consider a positively skewed distribution (refer to Figure 4.20). If you take either the base 10 or natural log of the scores in Figure 4.20 and do a histogram of the logs, the extreme scores on the upper end of the distribution appear to be closer to the body of the distribution, and the distribution shape can appear to be closer to normal when a log transformation is applied. However, a log transformation changes the shape of a sample distribution of scores substantially only when the maximum score on X is 10 or 100 times as high as the minimum score on X; if scores on X are ratings on a 1- to 5-point scale, for example, a log transformation applied to these scores will not substantially change the shape of the distribution.
7.9.9 Attenuation of Correlation Due to Unreliability of Measurement
Other things being equal, when the X and Y variables have low measurement reliability, this low reliability tends to decrease or attenuate their observed correlations with other variables. A reliability coefficient for an X variable is often denoted by rXX. One way to estimate a reliability coefficient for a quantitative X variable would be to measure the same group of participants on two occasions and correlate the scores at Time 1 and Time 2 (a test-retest correlation). (Values of reliability coefficients rXX range from 0 to 1.) The magnitude of this attenuation of observed correlation as an artifact of unreliability is given by the following formula:
where rXY is the observed correlation between X and Y, ρXY is the “real” correlation between X and Y that would be obtained if both variables were measured without error, and rXX and rYY are the test-retest reliabilities of the variables.
Because reliabilities are usually less than perfect (rXX < 1), Equation 7.20 implies that the observed rXY will generally be smaller than the “true” population correlation ρXY. The lower the reliabilities, the greater the predicted attenuation or reduction in magnitude of the observed sample correlation.
It is theoretically possible to correct for attenuation and estimate the true correlation ρXY, given obtained values of rXY, rXX, and rYY, but note that if the reliability estimates themselves are inaccurate, this estimated true correlation may be quite misleading. Equation 7.21 can be used to generate attenuation-corrected estimates of correlations; however, keep in mind that this correction will be inaccurate if the reliabilities of the measurements are not precisely known:
7.9.10 Part-Whole Correlations
If you create a new variable that is a function of one or more existing variables (as in X = Y + Z, or X = Y − Z), then the new variable X will be correlated with its component parts Y and Z. Thus, it is an artifact that the total Wechsler Adult Intelligence Scale (WAIS) score (which is the sum of WAIS verbal + WAIS quantitative) is correlated to the WAIS verbal subscale. Part-whole correlations can also occur as a consequence of item overlap: If two psychological tests have identical or very similar items contained in them, they will correlate artifactually because of item overlap. For example, many depression measures include questions about fatigue, sleep disturbance, and appetite disturbance. Many physical illness symptom checklists also include questions about fatigue and sleep and appetite disturbance. If a depression score is used to predict a physical symptom checklist score, a large part of the correlation between these could be due to duplication of items.
7.9.11 Aggregated Data
Correlations can turn out to be quite different when they are computed on individual participant data versus aggregated data where the units of analysis correspond to groups of participants. It can be misleading to make inferences about individuals based on aggregated data; sociologists call this the “ecological fallacy.” Sometimes relationships appear much stronger when data are presented in aggregated form (e.g., when each data point represents a mean, median, or rate of occurrence for a geographical region). Keys (1980) collected data on serum cholesterol and on coronary heart disease outcomes for N = 12,763 men who came from 19 different geographical regions around the world. He found a correlation near zero between serum cholesterol and coronary heart disease for individual men; however, when he aggregated data for each of the 19 regions and looked at median serum cholesterol for each region as a predictor of the rate of coronary heart disease, the r for these aggregated data (in Figure 7.24) went up to +.80.
7.10 Pearson’s r and r2 as Effect-Size Indexes
The indexes that are used to describe the effect size or strength of linear relationship in studies that report Pearson’s r values are usually either just r itself or r2, which estimates the proportion of variance in Y that can be predicted from X (or, equivalently, the proportion of variance in X that is predictable from Y). The proportion of explained variance (r2) can be diagramed by using overlapping circles, as shown in Figure 7.25. Each circle represents the unit variance of z scores on each variable, and the area of overlap is proportional to r2, the shared or explained variance. The remaining area of each circle corresponds to 1 – r2, for example, the proportion of variance in Y that is not explained by X. In meta-analysis research, where results are combined across many studies, it is more common to use r itself as the effect-size indicator. Cohen (1988) suggested the following verbal labels for sizes of r: r of about .10 or less (r2 < .01) is small, r of about .30 (r2 = .09) is medium, and r greater than .50 (r2 > .25) is large. (Refer to Table 5.2 for a summary of suggested verbal labels for r and r2.)
As shown in Figure 7.25, r2 and (1 – r2) provide a partition of variance for the scores in the sample. For example, the variance in Y scores can be partitioned into a proportion that is predictable from X (r2) and a proportion that is not predictable from X (1 – r2). It is important to understand that these proportions of variance will vary across situations. Consideration of two artificial situations illustrates this. Suppose that a researcher wants to know what proportion of variance in children’s height (Y) is predictable from the parent’s height (X). Height can be influenced by other factors; malnutrition can reduce height substantially. In Study A, let’s suppose that all of the children receive good nourishment. In this situation, parent height (X) will probably predict a very high proportion of variance in child height (Y). In Study B, by contrast, let’s assume that the children receive widely varying levels of nourishment. In this situation, there will be substantial variance in the children’s height that is associated with nourishment (this variance is included in 1 – r2) and, thus, a lower proportion of variance that is predictable from adult height (X) compared with the results in Study A. The point of this example is that the partition of variance in a sample depends very much on the composition of the sample and whether variables other than the X predictor of interest vary within the sample and are related to scores on Y. In other words, we should be very cautious about generalizing partition of variance obtained for one sample to describe partition of variance in other groups or in broader populations. The partition of variance that we obtain in one sample is highly dependent on the composition of that sample. For research questions such as what proportion of variance in intelligence is predictable from genes, the results in an individual sample depend heavily on the amount of variation in environmental factors that may also influence intelligence.
Figure 7.24 A Study in Which the Correlation Between Serum Cholesterol and Cardiovascular Disease Outcomes Was r = +.80 for Aggregated Scores
NOTE: Each data point summarizes information for one of the 19 geographic regions in the study. In contrast, the correlation between these variables for individual-level scores was close to zero.
SOURCE: Reprinted by permission of the publisher from Seven Countries: A Multivariate Analysis of Death and Coronary Heart Disease by Ancel Keys, p. 122, Cambridge, Mass., Harvard University Press, copyright © 1980 by the President and Fellows of Harvard College.
Figure 7.25 Proportion of Shared Variance Between X and Y Corresponds to r2, the Proportion of Overlap Between the Circles
7.11 Statistical Power and Sample Size for Correlation Studies
Statistical power is the likelihood of obtaining a sample r large enough to reject H0: ρ = 0 when the population correlation ρ is really nonzero. As in earlier discussions of statistical power (for the independent samples t test in Chapter 5 and the one-way between-subjects ANOVA in Chapter 6), statistical power for a correlation depends on the following factors: the true effect size in the population (e.g., the real value of ρ or ρ2 in the population of interest), the alpha level that is used as a criterion for statistical significance, and N, the number of subjects for whom we have X, Y scores. Using Table 7.4, it is possible to look up the minimum N of participants required to obtain adequate statistical power for different population correlation values. For example, let α = .05, two-tailed; set the desired level of statistical power at .80 or 80%; and assume that the true population value of the correlation is ρ = .5. This implies a population ρ2 of .25. From Table 7.4, a minimum of N = 28 subjects would be required to have power of 80% to obtain a significant sample result if the true population correlation is ρ = .50. Note that for smaller effects (e.g., a ρ2 value on the order of .05), sample sizes need to be substantially larger; in this case, N = 153 would be needed to have power of .80. It is generally a good idea to have studies with at least N = 100 cases where correlations are reported, partly to have adequate statistical power but also to avoid situations where there is not enough information to evaluate whether assumptions (such as bivariate normality and linearity) are satisfied and to avoid situations where one or two extreme outliers can have a large effect on the size of the sample r.
Table 7.4 Statistical Power for Pearson’s r
SOURCE: Adapted from Jaccard and Becker (2002).
7.12 Interpretation of Outcomes for Pearson’s r
7.12.1 “Correlation Does Not Necessarily Imply Causation” (So What Does It Imply?)
Introductory statistics students are generally told, “Correlation does not imply causation.” This can be more precisely stated as “Correlational (or nonexperimental) design does not imply causation.” When data are obtained from correlational or nonexperimental research designs, we cannot make causal inferences. In the somewhat uncommon situations where Pearson’s r is applied to data from well-controlled experiments, tentative causal inferences may be appropriate. As stated earlier, it is the nature of the research design, not the statistic that happens to be used to analyze the data, that determines whether causal inferences might be appropriate.
Many researchers who conduct nonexperimental studies and apply Pearson’s r to variables do so with causal theories implicitly in mind. When a researcher measures social stress and blood pressure in natural settings and correlates these measures, he or she expects to see a positive correlation, in part because the researcher suspects that social stress may “cause” an increase in blood pressure. Very often, predictor variables are selected for use in correlational research because the researcher believes that they may be “causal.” If the researcher finds no statistical association between X and Y (Pearson’s r that is close to 0 and no evidence of any other type of statistical association, such as a curvilinear relation), this finding of no statistical relationship is inconsistent with the belief that X causes Y. If X did cause Y, then increases in X should be statistically associated with changes in Y. If we find a significant correlation, it is consistent with the idea that X might cause Y, but it is not sufficient evidence to prove that X causes Y. Scores on X and Y can be correlated for many reasons; a causal link between X and Y is only one of many situations that tends to create a correlation between X and Y. A statistical association between X and Y (such as a significant Pearson’s r) is a necessary, but not sufficient, condition for the conclusion that X might cause Y.
Many other situations (apart from “X causes Y”) may give rise to correlations between X and Y. Here is a list of some of the possible reasons for an X, Y correlation. We can conclude that X causes Y only if we can rule out all these other possible explanations. (And in practice, it is almost never possible to rule out all these other possible explanations in nonexperimental research situations; well-controlled experiments come closer to accomplishing the goal of ruling out rival explanations, although a single experiment is not a sufficient basis for a strong causal conclusion.)
Reasons why X and Y may be correlated include the following:
1. X may be a cause of Y.
2. Y may be a cause of X.
3. X might cause Z, and in turn, Z might cause Y. In causal sequences like this, we say that the Z variable “mediates” the effect of X on Y or that Z is a “mediator” or “mediating” variable.
4. X is confounded with some other variable Z, and Z predicts or causes Y. In a well- controlled experiment, we try to artificially arrange the situation so that no other variables are confounded systematically with our X intervention variable. In nonexperimental research, we often find that our X variable is correlated with or confounded with many other variables that might be the “real” cause of Y. In most nonexperimental studies, there are potentially dozens or hundreds of potential nuisance (or confounded) Z variables. For instance, we might try to predict student GPA from family structure (single parent vs. two parent), but if students with single-parent families have lower GPAs than students from two-parent families, this might be because single-parent families have lower incomes, and lower-income neighborhoods may have poor-quality schools that lead to poor academic performance.
5. X and Y might actually be measures of the same thing, instead of two separate variables that could be viewed as cause and effect. For example, X could be a depression measure that consists of questions about both physical and mental symptoms; Y could be a checklist of physical health symptoms. If X predicts Y, it might be because the depression measure and the health measure both included the same questions (about fatigue, sleep disturbance, low energy level, appetite disturbance, etc.).
6. Both X and Y might be causally influenced by some third variable Z. For instance, both ice cream sales (X) and homicides (Y) tend to increase when temperatures (Z) go up. When X and Y are both caused or predicted by some third variable and X has no direct causal influence on Y, the X, Y correlation is called spurious.
7. Sometimes a large X, Y correlation arises just due to chance and sampling error; it just happens that the participants in the sample tended to show a strong correlation because of the luck of the draw, even though the variables are not correlated in the entire population. (When we use statistical significance tests, we are trying to rule out this possible explanation, but we can never be certain that a significant correlation was not simply due to chance.)
This list does not exhaust the possibilities; apart from the single Z variable mentioned in these examples, there could be numerous other variables involved in the X, Y relationship.
In well-controlled experiments, a researcher tries to arrange the situation so that no other Z variable is systematically confounded with the independent variable X. Experimental control makes it possible, in theory, to rule out many of these possible rival explanations. However, in nonexperimental research situations, when a large correlation is found, all these possible alternative explanations have to be considered and ruled out before we can make a case for the interpretation that “X causes Y”; in practice, it is not possible to rule out all of these alternative reasons why X and Y are correlated in nonexperimental data. It is possible to make the case for a causal interpretation of a correlation somewhat stronger by statistically controlling for some of the Z variables that you know are likely to be confounded with X, but it is never possible to identify and control for all the possible confounds. That’s why we have to keep in mind that “correlational design does not imply causation.”
7.12.2 Interpretation of Significant Pearson’s r Values
Pearson’s r describes the strength and direction of the linear predictive relationship between variables. The sign of r indicates the direction of the relationship; for a positive r, as scores on X increase, scores on Y also tend to increase; for a negative r, as scores on X increase, scores on Y tend to decrease. The r value indicates the magnitude of change in for a 1-SD change in zX. For example, if r = +.5, for a 1-SD increase in zX, we predict a .5-SD increase in zY. When researchers limit themselves to descriptive statements about predictive relationships or statistical associations between variables, it is sufficient to describe this in terms of the direction and magnitude of the change in zY associated with change in zX.
Thus, a significant Pearson correlation can be interpreted as information about the degree to which scores on X and Y are linearly related, or the degree to which Y is predictable from X. Researchers often examine correlations among variables as a way of evaluating whether variables might possibly be causally related. In many studies, researchers present a significant correlation between X and Y as evidence that X and Y might possibly be causally related. However, researchers should not interpret correlations that are based on nonexperimental research designs as evidence for causal connections. Experiments in which X is manipulated and other variables are controlled and the Y outcome is measured provide stronger evidence for causal inference.
7.12.3 Interpretation of a Nonsignificant Pearson’s r Value
Pearson’s r near zero does not always indicate a complete lack of relationship between variables. A correlation that is not significantly different from zero might be due to a true lack of any relationship between X and Y (as in Figure 7.5), but correlations near zero also arise when there is a strong but curvilinear relation between X and Y (as in Figure 7.7) or when one or a few bivariate outliers are not consistent with the pattern of relationship suggested by the majority of the data points (as in Figure 7.11). For this reason, it is important to examine the scatter plot before concluding that X and Y are not related.
If examination of the scatter plot suggests a nonlinear relation, a different analysis may be needed to describe the X, Y association. For example, using the multiple regression methods described in a later chapter, Y may be predicted from X, X2, X3, and other powers of X. Other nonlinear transformations (such as log of X) may also convert scores into a form where a linear relationship emerges. Alternatively, if the X variable is recoded to yield three or more categories (e.g., if income in dollars is recoded into low-, medium-, and high-income groups), a one-way ANOVA comparing scores among these three groups may reveal differences.
If a sample correlation is relatively large but not statistically significant due to small sample size, it is possible that the correlation might be significant in a study with a larger sample. Only additional research with larger samples can tell us whether this is the case. When a nonsignificant correlation is obtained, the researcher should not conclude that the study proves the absence of a relation between X and Y. It would be more accurate to say that the study did not provide evidence of a linear relationship between X and Y.
7.13 SPSS Output and Model Results Write-Up
To obtain a Pearson correlation using SPSS, the menu selections (from the main menu above the data worksheet) are <Analyze> → <Correlate> → <Bivariate>, as shown in Figure 7.26. These menu selections open up the SPSS Correlations dialog window, shown in Figure 7.27.
The data analyst uses the cursor to highlight the names of the variables in the left-hand window (which lists all the variables in the active data file) that are to be intercorrelated. Then, the user clicks on the arrow button to move two or more variable names into the list of variables to be analyzed. In this example, the variables to be correlated are named commit (commitment) and intimacy. Other boxes can be checked to determine whether significance tests are to be displayed and whether two-tailed or one-tailed p values are desired. To run the analyses, click the OK button. The output from this procedure is displayed in Figure 7.28, which shows the value of the Pearson correlation (r = + .745), the p value (which would be reported as p < .001, two-tailed),3 and the N of data pairs the correlation was based on (N = 118). The degrees of freedom for this correlation are given by N – 2, so in this example, the correlation has 116 df.
It is possible to run correlations among many pairs of variables. The SPSS Correlations dialog window that appears in Figure 7.29 includes a list of five variables: intimacy, commit, passion, length (of relationship), and times (the number of times the person has been in love). If the data analyst enters a list of five variables as shown in this example, SPSS runs the bivariate correlations among all possible pairs of these five variables (this results in (5 × 4)/2 = 10 different correlations), and these correlations are reported in a summary table as shown in Figure 7.30. Note that because correlation is “symmetrical” (i.e., the correlation between X and Y is the same as the correlation between Y and X), the correlations that appear in the upper right-hand corner of the table in Figure 7.30 are the same as those that appear in the lower left-hand corner. When such tables are presented in journal articles, usually only the correlations in the upper right-hand corner are shown.
Sometimes a data analyst wants to obtain summary information about the correlations between a set of predictor variables X1 and X2 (length and times) and a set of outcome variables Y1, Y2, and Y3 (intimacy, commitment, and passion). To do this, we need to paste and edit SPSS Syntax. Look again at the SPSS dialog window in Figure 7.29; there is a button labeled Paste just below OK. Clicking the Paste button opens up a new window, called a Syntax window, and pastes the SPSS commands (or syntax) for correlation that were generated by the user’s menu selections into this window; the initial SPSS Syntax window appears in Figure 7.31. This syntax can be saved, printed, or edited. In this example, we will edit the syntax; in Figure 7.32, the SPSS keyword WITH has been inserted into the list of variable names so that the list of variables in the CORRELATIONS command now reads, “intimacy commit passion WITH length times.” (It does not matter whether the SPSS commands are in uppercase or lowercase; the word WITH appears in uppercase characters in this example to make it easy to see the change in the command.) For a correlation command that includes the key word WITH, SPSS computes correlations for all pairs of variables on the lists that come before and after WITH. In this example, each variable in the first list (intimacy, commit, and passion) is correlated with each variable in the second list (length and times). This results in a table of six correlations, as shown in Figure 7.33.
Figure 7.26 Menu Selections for the Bivariate Correlation Procedure
Figure 7.27 SPSS Bivariate Correlations Procedure Dialog Window
Figure 7.28 SPSS Output for the Bivariate Correlation Procedure
Figure 7.29 Bivariate Correlations Among All Possible Pairs of Variables in a List of Selected Variables
If many variables are included in the list for the bivariate correlation procedure, the resulting table of correlations can be several pages long. It is often useful to set up smaller tables for each subset of correlations that is of interest, using the WITH command to designate which variables should be paired.
Note that the p values given in the SPSS printout are not adjusted in any way to correct for the inflated risk of Type I error that arises when large numbers of significance tests are reported. If the researcher wants to control or limit the risk of Type I error, it can be done by using the Bonferroni procedure. For example, to hold the EWα level to .05, the PCα level used to test the six correlations in Table 7.5 could be set to α = .05/6 = .008. Using this Bonferroni-corrected PCα level, none of the correlations in Figure 7.33 would be judged statistically significant.
Figure 7.30 SPSS Output: Bivariate Correlations Among All Possible Pairs of Variables
Figure 7.31 Pasting SPSS Syntax Into a Syntax Window and Editing Syntax
Figure 7.32 Edited SPSS Syntax
Figure 7.33 SPSS Output From Edited Syntax: Correlations Between Variables in the First List (Intimacy, Commitment, Passion) and Variables in the Second List (Length of Present Dating Relationship, Number of Times in Love)
Table 7.5 Correlations Between Sternberg’s (1997) Triangular Love Scale (Intimacy, Commitment, and Passion) and Length of Relationship and Number of Times in Love (N − 118 Participants)
NOTE: The p values for these correlations are reported for each single significance test; the p values have not been adjusted to correct for inflation of Type I error that may arise when multiple significance tests are reported.
*p < .05, two-tailed.
Results: One Correlation
A Pearson correlation was performed to assess whether levels of intimacy in dating relationships could be predicted from levels of commitment on a self-report survey administered to 118 college students currently involved in dating relationships. Commitment and intimacy scores were obtained by summing items on two of the scales from Sternberg’s (1997) Triangular Love Scale; the range of possible scores was from 15 (low levels of commitment or intimacy) to 75 (high levels of commitment or intimacy). Examination of histograms indicated that the distribution shapes were not close to normal for either variable. Both distributions were skewed. For both variables, most scores were near the high end of the scale, which indicated the existence of ceiling effects, and there were a few isolated outliers at the low end of the scale. However, the skewness was not judged severe enough to require data transformation or removal of outliers. The scatter plot of intimacy with commitment suggested a positive linear relationship. The correlation between intimacy and commitment was statistically significant, r(116) = +.75, p < .001 (two-tailed). The r2 was .56; thus, about 56% of the variance in intimacy could be predicted from levels of commitment. This relationship remained strong and statistically significant, r(108) = +.64, p < .001, two-tailed, even when outliers with scores less than 56 on intimacy and 49 on commitment were removed from the sample.
Results: Several Correlations
Pearson correlations were performed to assess whether levels of self-reported intimacy, commitment, and passion in dating relationships could be predicted from the length of the dating relationship and the number of times the participant has been in love, based on a self-report survey administered to 118 college students currently involved in dating relationships. Intimacy, commitment, and passion scores were obtained by summing items on scales from Sternberg’s (1997) Triangular Love Scale; the range of possible scores was from 15 to 75 on each of the three scales. Examination of histograms indicated that the distribution shapes were not close to normal for any of these variables; distributions of scores were negatively skewed for intimacy, commitment, and passion. Most scores were near the high end of the scale, which indicated the existence of ceiling effects, and there were a few isolated outliers at the low end of the scale. However, the skewness was not judged severe enough to require data transformation or removal of outliers. The scatter plots suggested that when there were relationships between pairs of variables, these relationships were (weakly) linear. The six Pearson correlations are reported in Table 7.5. Only the correlation between commitment and length of relationship was statistically significant, r(116) = +.20, p < .05 (two-tailed). The r2 was .04; thus, only about 4% of the variance in commitment scores could be predicted from length of the relationship; this is a weak positive relationship. There was a tendency for participants who reported longer relationship duration to report higher levels of commitment.
If Bonferroni-corrected PCα levels are used to control for the inflated risk of Type I error that occurs when multiple significance tests are performed, the PCα level is .05/6 = .008. Using this more conservative criterion for statistical significance, none of the six correlations in Table 7.5 would be judged statistically significant.
7.14 Summary
This chapter described the use of Pearson’s r to describe the strength and direction of the linear association between a pair of quantitative variables. The conceptual formula for r (given in terms of zX and zY) was discussed. Computing products between zX and zY provides information about the spatial distribution of points in an X, Y scatter plot and the tendency for high scores on X to be systematically associated with high (or low) scores on Y. Fisher Z was introduced because it is needed for the construction of confidence intervals around sample r values and in some of the significance tests for r. Pearson’s r itself can be interpreted as an effect-size index; sometimes, r2 is also reported to describe the strength of relationship in terms of the proportion of variance in Y that is predictable from X. Pearson’s r is symmetric: That is, the correlation between X and Y is identical to the correlation between Y and X. Many of the analyses introduced in later chapters (such as partial correlation and multiple regression) are based on Pearson’s r; factors that artifactually influence the magnitude of Pearson’s r (such as restricted range, unreliability, and bivariate outliers) can also influence the magnitude of regression coefficients. Pearson’s r is often applied to X and Y variables that are both quantitative; however, it is also possible to use special forms of Pearson’s r (such as the point biserial correlation) when one or both of the variables are dichotomous (categorical, with only two categories), as discussed in Chapter 8.
Notes
1. One version of the formula to calculate Pearson’s r from the raw scores on X and Y is as follows:
This statistic has N – 2 df, where N is the number of (X, Y) pairs of observations.
The value of Pearson’s r reported by SPSS is calculated using N − 1 as the divisor:
2. In more advanced statistical methods such as structural equation modeling, covariances rather than correlation are used as the basis for estimation of model parameters and evaluation of model fit.
3. Note that SPSS printouts sometimes report “p = .000” or “Sig = .000.” (Sig is an SPSS abbreviation for significance, and p means probability or risk of Type I error; you want the risk of error to be small, so you usually hope to see p values that are small, i.e., less than .05.) These terms, p and sig, represent the theoretical risk of Type I error. This risk can never be exactly zero, but it becomes smaller and smaller as r increases and/or as N increases. It would be technically incorrect to report an exact p value as p = .000 in your write-up. Instead, when SPSS gives p as .000, you should write, “p < .001,” to indicate that the risk of Type I error was estimated to be very small.
Comprehension Questions
1. |
A meta-analysis (Anderson & Bushman, 2001) reported that the average correlation between time spent playing violent video games (X) and engaging in aggressive behavior (Y) in a set of 21 well-controlled experimental studies was .19. This correlation was judged to be statistically significant. In your own words, what can you say about the nature of the relationship? |
2. |
Harker and Keltner (2001) examined whether emotional well-being in later life could be predicted from the facial expressions of 141 women in their college yearbook photos. The predictor variable of greatest interest was the “positivity of emotional expression” in the college yearbook photo. They also had these photographs rated on physical attractiveness. They contacted the same women for follow-up psychological assessments at age 52 (and at other ages; data not shown here). Here are the correlations of these two predictors (based on ratings of the yearbook photo) with several of their self-reported social and emotional outcomes at age 52: |
|
a. |
Which of the six correlations above are statistically significant (i) if you test each correlation using α = .05, two-tailed, and (ii) if you set EWα = .05 and use Bonferroni-corrected tests? |
|
b. |
How would you interpret their results? |
|
c. |
Can you make any causal inferences from this study? Give reasons. |
|
d. |
Would it be appropriate for the researchers to generalize these findings to other groups, such as men? |
|
e. |
What additional information would be available to you if you were able to see the scatter plots for these variables? |
3. |
Are there ever any circumstances when a correlation such as Pearson’s r can be interpreted as evidence for a causal connection between two variables? If yes, what circumstances? |
|
4. |
For a correlation of −.64 between X and Y, each 1-SD change in zX corresponds to a predicted change of ____ SD in zY. |
|
5. |
A researcher says that 50% of the variance in blood pressure can be predicted from HR and that blood pressure is positively associated with HR. What is the correlation between blood pressure and HR? |
|
6. |
Suppose that you want to do statistical significance tests for four correlations and you want your EWα to be .05. What PCα would you use if you apply the Bonferroni procedure? |
|
7. |
Suppose that you have two different predictor variables (X and Z) that you use to predict scores on Y. What formula would you need to use to assess whether their correlations with Y differ significantly? What information would you need to have to do this test? Which of these is the appropriate null hypothesis? |
H0: ρ = 0 H0: ρ = .9 H0: ρ1 = ρ2 H0: ρXY = ρZY
|
What test statistic should be used for each of these null hypotheses? |
8. |
Why should researchers be very cautious about comparison of correlations that involve different variables? |
9. |
How are r and r2 interpreted? |
10. |
Draw a diagram to show r2 as an overlap between circles. |
11. |
If “correlation does not imply causation,” what does it imply? |
12. |
What are some of the possible reasons for large correlations between a pair of variables, X and Y? |
13. |
What does it mean to say that r is a symmetrical measure of a relationship? |
14. |
Suppose that two raters (Rater A and Rater B) each assign physical attractiveness scores (0 = not at all attractive to 10 = extremely attractive) to a set of seven facial photographs. Pearson’s r is a common index of interrater reliability or agreement on quantitative ratings. A correlation of +1 would indicate perfect rank order agreement between raters, while an r of 0 would indicate no agreement about judgments of relative attractiveness. Generally rs of .8 to .9 are considered desirable when reliability is assessed. The attractiveness ratings are as follows: |
|
a. |
Compute the Pearson correlation between the Rater A/Rater B attractiveness ratings. What is the obtained r value? |
|
b. |
Is your obtained r statistically significant? (Unless otherwise specified, use α = .05, two-tailed, for all significance tests.) |
|
c. |
Are the Rater A and Rater B scores “reliable”? Is there good or poor agreement between raters? |
15. |
From a review of Chapters 5 and 6, what other analyses could you do with the variables in the SPSS dataset love.sav (variables described in Table 7.1)? Give examples of pairs of variables for which you could do t tests or one-way ANOVA. Your teacher may ask you to run these analyses and write them up. |
|
16. |
Explain how the formula r =∑(zX × zY)/N is related to the pattern of points in a scatter plot (i.e., the numbers of concordant/discordant pairs). |
|
17. |
What assumptions are required for a correlation to be a valid description of the relation between X and Y? |
|
18. |
What is a bivariate normal distribution? Sketch the three-dimensional appearance of a bivariate normal distribution of scores. |
|
19. |
When r = 0, does it necessarily mean that X and Y are completely unrelated? |
|
20. |
Discuss how nonlinear relations may result in small rs. |
|
21. |
Sketch the sampling distribution of r when ρ = 0 and the sampling distribution of ρ when r = .80. Which of these two distributions is nonnormal? What do we do to correct for this nonnormality when we set up significance tests? |
|
22. |
What is a Fisher Z, how is it obtained, and what is it used for? |
|
23. |
In words, what does the equation = r × zX say? |
|
24. |
Take one of the following datasets: the SPSS file love.sav or some other dataset provided by your teacher, data obtained from your own research, or data downloaded from the Web. From your chosen dataset, select a pair of variables that would be appropriate for Pearson’s r. Examine histograms and a scatter plot to screen for possible violations of assumptions; report any problems and any steps you took to remedy problems, such as removal of outliers (see Chapter 4). Write up a brief Results section reporting the correlation, its statistical significance, and r2. Alternatively, your teacher may ask you to run a list or group of correlations and present them in table form. If you are reporting many correlations, you may want to use Bonferroni-protected tests. |
Warner, Rebecca (Becky) (Margaret). Applied Statistics: From Bivariate Through Multivariate Techniques, 2nd Edition. SAGE Publications, Inc, 04/2012. VitalBook file.
Chapter 8 - ALTERNATIVE CORRELATION COEFFICIENTS
8.1 Correlations for Different Types of Variables
Pearson correlation is generally introduced as a method to evaluate the strength of linear association between scores on two quantitative variables, an X predictor and a Y outcome variable. If the scores on X and Y are at least interval level of measurement and if the other assumptions for Pearson’s r are satisfied (e.g., X and Y are linearly related, X and Y have a bivariate normal distribution), then Pearson’s r is generally used to describe the strength of the linear relationship between variables. However, we also need to have indexes of correlation for pairs of variables that are not quantitative or that fall short of having equal-interval level of measurement properties or that have joint distributions that are not bivariate normal. This chapter discusses some of the more widely used alternative bivariate statistics that describe strength of association or correlation for different types of variables.
When deciding which index of association (or which type of correlation coefficient) to use, it is useful to begin by identifying the type of measurement for the X and Y variables. The X or independent variable and the Y or dependent variable may each be any of the following types of measurement:
1. Quantitative with interval/ratio measurement properties
2. Quantitative but only ordinal or rank level of measurement
3. Nominal or categorical with more than two categories
4. Nominal with just two categories
a. A true dichotomy b. An artificial dichotomy
Table 8.1 presents a few of the most widely used correlation statistics. For example, if both X and Y are quantitative and interval/ratio (and if the other assumptions for Pearson’s r are satisfied), the Pearson product-moment correlation (discussed in Chapter 7 ) is often used to describe the strength of linear association between scores on X and Y. If the scores come in the form of rank or ordinal data or if it is necessary to convert scores into ranks to get rid of problems such as severely nonnormal distribution shapes or outliers, then Spearman r or Kendall’s tau (τ) may be used. If scores on X correspond to a true dichotomy and scores on Y are interval/ratio level of measurement, the point biserial correlation may be used. If scores on X and Y both correspond to true dichotomies, the phi coefficient (Φ) can be reported. Details about computation and interpretation of these various types of correlation coefficients appear in the following sections of this chapter.
Some of the correlation indexes listed in Table 8.1 , including Spearman r, point biserial r, and the phi coefficient, are equivalent to Pearson’s r. For example, a Spearman r can be obtained by converting scores on X and Y into ranks (if they are not already in the form of ranks) and then computing Pearson’s r for the ranked scores. A point biserial r can be obtained by computing Pearson’s r for scores on a truly dichotomous X variable that typically has only two values (e.g., gender, coded 1 = female, 2 = male) and scores on a quantitative Y variable (such as heart rate, HR). (The use of gender or sex as an example of a true dichotomy could be questioned. Additional categories such as transsexual could be included in some studies.) A phi coefficient can be obtained by computing Pearson’s r between scores on two true dichotomies (e.g., Does the person take a specific drug? 1 = no, 2 = yes; Does the person have a heart attack within 1 year? 1 = no, 2 = yes). Alternative computational formulas are available for Spearman r, point biserial r, and the phi coefficient, but the same numerical results can be obtained by applying the formula for Pearson r. Thus, Spearman r, point biserial r, and phi coefficient are equivalent to Pearson’s r. Within SPSS, you obtain the same results when you use the Pearson’s r procedure to compute a correlation between drug use (quantitative variable) and death (true dichotomy variable) as when you request a phi coefficient between drug use and death in the Crosstabs procedure. On the other hand, some of the other correlation statistics listed in Table 8.1 (such as the tetrachoric correlation rtet , biserial r, and Kendall’s tau) are not equivalent to Pearson’s r.
For many combinations of variables shown in Table 8.1 , several different statistics can be reported as an index of association. For example, for two truly dichotomous variables, such as drug use and death, Table 8.1 lists the phi coefficient as an index of association, but it is also possible to report other statistics such as chi-square and Cramer’s V , described in this chapter, or log odds ratios , described in Chapter 23 on binary logistic regression.
Later chapters in this textbook cover statistical methods that are implicitly or explicitly based on Pearson’s r values and covariances. For example, in multiple regression (in Chapters 11 and 14 ), the slope coefficients for regression equations can be computed based on sums of squares and sums of cross products based on the X and Y scores, or from the Pearson correlations among variables and the means and standard deviations of variables. For example, we could predict a person’s HR from that person’s scores on several different X predictors (X1 = gender, coded 1 = female, 2 = male; X2 = age in years; X3 = body weight in pounds):
Table 8.1 Widely Used Correlations for Various Types of Independent Variables (X) and Dependent Variables (Y) (Assuming That Groups Are Between Subjects or Independent)
a. There may be no purpose-designed statistic for some combinations of types of variables, but it is usually possible to downgrade your assessment of the level of measurement of one or both variables. For example, if you have an X variable that is interval/ratio and a Y variable that is ordinal, you could convert scores on X to ranks and use Spearman r. It might also be reasonable to apply Pearson’s r in this situation.
b. In practice, researchers do not always pay attention to the existence of artificial dichotomies when they select statistics. Tetrachoric r and biserial r are rarely reported.
When researchers use dichotomous variables (such as gender) as predictors in multiple regression, they implicitly assume that it makes sense to use Pearson’s r to index the strength of relationship between scores on gender and scores on the outcome variable HR. In this chapter, we will examine the point biserial r and the phi coefficient and demonstrate by example that they are equivalent to Pearson’s r. An important implication of this equivalence is that we can use true dichotomous variables (also called dummy variables; see Chapter 12 ) as predictors in regression analysis.
However, some problems can arise when we include dichotomous predictors in correlation-based analyses such as regression. In Chapter 7 , for example, it was pointed out that the maximum value of Pearson’s r, r = +1.00, can occur only when the scores on X and Y have identical distribution shapes. This condition is not met when we correlate scores on a dichotomous X variable such as gender with scores on a quantitative variable such as HR.
Many of the statistics included in Table 8.1 , such as Kendall’s tau, will not be mentioned again in later chapters of this textbook. However, they are included because you might encounter data that require these alternative forms of correlation analysis and they are occasionally reported in journal articles.
8.2 Two Research Examples
To illustrate some of these alternative forms of correlation, two small datasets will be used. The first dataset, which appears in Table 8.2 and Figure 8.1 , consists of a hypothetical set of scores on a true dichotomous variable (gender) and a quantititave variable that has interval/ratio level of measurement properties (height). The relationship between gender and height can be assessed by doing an independent samples t test to compare means on height across the two gender groups (as described in Chapter 5 ). However, an alternative way to describe the strength of association between gender and height is to calculate a point biserial correlation, rpb, as shown in this chapter.
The second set of data come from an actual study (Friedmann, Katcher, Lynch, & Thomas, 1980) in which 92 men who had a first heart attack were asked whether or not they owned a dog. Dog ownership is a true dichotomous variable, coded 0 = no and 1 = yes; this was used as a predictor variable. At the end of a 1-year follow-up, the researchers recorded whether each man had survived; this was the outcome or dependent variable. Thus, the outcome variable, survival status, was also a true dichotomous variable, coded 0 = no, did not survive and 1 = yes, did survive. The question in this study was whether survival status was predictable from dog ownership. The strength of association between these two true dichotomous variables can be indexed by several different statistics, including the phi coefficient; a test of statistical significance of the association between two nominal variables can be obtained by performing a chi-square (χ2) test of association. The data from the Friedmann et al. (1980) study appear in the form of a data file in Table 8.3 and as a summary table of observed cell frequencies in Table 8.4 .
Similarities among the indexes of association (correlation indexes) covered in this chapter include the following:
Table 8.2 Data for the Point Biserial r Example: Gender (Coded 1 = Male and 2 = Female) and Height in Inches
1. The size of r (its absolute magnitude) provides information about the strength of association between X and Y. In principle, the range of possible values for the Pearson correlation is −1 ≤ r ≤ +1; however, in practice, the maximum possible values of r may be limited to a narrower range. Perfect correlation (either r = +1 or r = −1) is possible only when the X, Y scores have identical distribution shapes.
Figure 8.1 Scatter Plot for Relation Between a True Dichotomous Predictor (Gender) and a Quantitative Dependent Variable (Height)
NOTE: M1 = mean male height; M2 = mean female height.
When distribution shapes for X and Y differ, the maximum possible correlation between X and Y is often somewhat less than 1 in absolute value. For the phi coefficient, we can calculate the maximum possible value of phi given the marginal distributions of X and Y. Some indexes of association covered later in the textbook, such as the log odds ratios in Chapter 23 , are scaled quite differently and are not limited to values between −1 and +1.
2. For correlations that can have a plus or a minus sign, the sign of r provides information about the direction of association between scores on X and scores on Y. However, in many situations, the assignment of lower versus higher scores is arbitrary (e.g., gender, coded 1 = female, 2 = male), and in such situations, researchers need to be careful to pay attention to the codes that were used for categories when they interpret the sign of a correlation. Some types of correlation (such as η and Cramer’s V) have a range from 0 to +1—that is, they are always positive.
3. Some (but not all) of the indexes of association discussed in this chapter are equivalent to Pearson’s r.
Ways in which the indexes of association may differ:
1. The interpretation of the meaning of these correlations varies. Chapter 7 described two useful interpretations of Pearson’s r. One involves the “mapping” of scores from zX to zY, or the prediction of a score from a zX score for each individual participant. Pearson’s r of 1 can occur only when there is an exact one-to-one correspondence between distances from the mean on X and distances from the mean on Y, and that in turn can happen only when X and Y have identical distribution shapes. A second useful interpretation of Pearson’s r was based on the squared correlation (r2). A squared Pearson correlation can be interpreted as “the proportion of variance in Y scores that is linearly predictable from X,” and vice versa. However, some of the other correlation indexes—even though they are scaled so that they have the same range from −1 to + 1 like Pearson’s r—have different interpretations.
Table 8.3 Dog Ownership/Survival Data
SOURCE: Friedmann et al. (1980).
NOTE: Prediction of survival status (true dichotomous variable) from dog ownership (true dichotomous variable): Dog ownership: 0 = no, 1 = yes. Survival status: 0 = no, did not survive; 1 = yes, did survive.
Table 8.4 Dog Ownership and Survival Status 1 Year After the First Heart Attack
SOURCE: Friedmann et al. (1980).
NOTE: The table shows the observed frequencies for outcomes in a survey study of N = 92 men who have had a first heart attack. The frequencies in the cells denoted by b and c represent concordant outcomes (b indicates answer “no” for both variables, c indicates answer “yes” for both variables). The frequencies denoted by a and d represent discordant outcomes (i.e., an answer of “yes” for one variable and “no” for the other variable). When calculating a phi coefficient by hand from the cell frequencies in a 2 × 2 table, information about the frequencies of concordant and discordant outcomes is used.
2. Some of the indexes of association summarized in this chapter are applicable only to very specific situations (such as 2 × 2 tables), while other indexes of association (such as the chi-square test of association) can be applied in a wide variety of situations.
3. Most of the indexes of association discussed in this chapter are symmetrical. For example, Pearson’s r is symmetrical because the correlation between X and Y is the same as the correlation between Y and X. However, there are some asymmetrical indexes of association (such as lambda and Somers’s d ). There are some situations where the ability to make predictions is asymmetrical; for example, consider a study about gender and pregnancy. If you know that an individual is pregnant, you can predict gender (the person must be female) perfectly. However, if you know that an individual is female, you cannot assume that she is pregnant. For further discussion of asymmetrical indexes of association, see Everitt (1977).
4. Some of the indexes of association for ordinal data are appropriate when there are large numbers of tied ranks; others are appropriate only when there are not many tied ranks. For example, problems can arise when computing Spearman r using the formula that is based on differences between ranks. Furthermore, indexes that describe strength of association between categorical variables differ in the way they handle tied scores; some statistics subtract the number of ties when they evaluate the numbers of concordant and discordant pairs (e.g., Kendall’s tau), while other statistics ignore cases with tied ranks.
8.3 Correlations for Rank or Ordinal Scores
Spearman r is applied in situations where the scores on X and Y are both in the form of ranks or in situations where the researcher finds it necessary or useful to convert X and Y scores into ranks to get rid of problems such as extreme outliers or extremely nonnormal distribution shapes. One way to obtain Spearman r, in by-hand computation, is as follows. First, convert scores on X into ranks. Then, convert scores on Y into ranks. If there are ties, assign the mean of the ranks for the tied scores to each tied score. For example, consider this set of X scores; the following example shows how ranks are assigned to scores, including tied ranks for the three scores equal to 25:
X |
Rank of X: R X |
30 |
1 |
28 |
2 |
25 |
(3 + 4 + 5)/3 = 4 |
25 |
(3 + 4 + 5)/3 = 4 |
25 |
(3 + 4 + 5)/3 = 4 |
24 |
6 |
20 |
7 |
12 |
8 |
For each participant, let di be the difference between ranks on the X and Y variables. The value of Spearman r, denoted by rs, can be found in either of two ways:
1. Compute the Pearson correlation between RX (rank on the X scores) and RY (rank on the Y scores).
2. Use the formula below to compute Spearman r (rs) from the differences in ranks:
where di = the difference between ranks = (RX − RY) and n = the number of pairs of (X, Y) scores or the number of di differences.
If rs = +1, there is perfect agreement between the ranks on X and Y; if rs = −1, the rank orders on X and Y are perfectly inversely related (e.g., the person with the highest score on X has the lowest score on Y).
Another index of association that can be used in situations where X and Y are either obtained as ranks or converted into ranks is Kendall’s tau; there are two variants of this called Kendall’s tau-b and Kendall’s tau-c . In most cases, the values of Kendall’s τ and Spearman r lead to the same conclusion about the nature of the relationship between X and Y. See Liebetrau (1983) for further discussion.
8.4 Correlations for True Dichotomies
Most introductory statistics books only show Pearson’s r applied to pairs of quantitative variables. Generally, it does not make sense to apply Pearson’s r in situations where X and/ or Y are categorical variables with more than two categories. For example, it would not make sense to compute a Pearson correlation to assess whether the categorical variable, political party membership (coded 1 = Democrat, 2 = Republican, 3 = Independent, 4 = Socialist, etc.), is related to income level. The numbers used to indicate party membership serve only as labels and do not convey any quantitative information about differences among political parties. The mean income level could go up, go down, or remain the same as the X scores change from 1 to 2, 2 to 3, and so on; there is no reason to expect a consistent linear increase (or decrease) in income as the value of the code for political party membership increases.
However, when a categorical variable has only two possible values (such as gender, coded 1 = male, 2 = female, or survival status, coded 1 = alive, 0 = dead), we can use the Pearson correlation and related correlation indexes to relate them to other variables. To see why this is so, consider this example: X is gender (coded 1 = male, 2 = female); Y is height, a quantitative variable (hypothetical data appear in Table 8.2 , and a graph of these scores is shown in Figure 8.1 ). Recall that Pearson’s r is an index of the linear relationship between scores on two variables. When X is dichotomous, the only possible relation it can have with scores on a continuous Y variable is linear. That is, as we move from Group 1 to Group 2 on the X variable, scores on Y may increase, decrease, or remain the same. In any of these cases, we can depict the X, Y relationship by drawing a straight line to show how the mean Y score for X = 1 differs from the mean Y score for X = 2.
See Figure 8.1 for a scatter plot that shows how height (Y) is related to gender (X); clearly, mean height is greater for males (Group 1) than for females (Group 2). We can describe the relationship between height (Y) and gender (X) by doing an independent samples t test to compare mean Y values across the two groups identified by the X variable, or we can compute a correlation (either Pearson’s r or a point biserial r) to describe how these variables are related. We shall see that the results of these two analyses provide equivalent information. By extension, it is possible to include dichotomous variables in some of the multivariable analyses covered in later chapters of this book. For example, when dichotomous variables are included as predictors in a multiple regression, they are usually called “dummy” variables. (see Chapter 12 ). First, however, we need to consider one minor complication.
Pearson’s r can be applied to dichotomous variables when they represent true dichotomies—that is, naturally occurring groups with just two possible outcomes. One common example of a true dichotomous variable is gender (coded 1 = male, 2 = female); another is survival status in a follow-up study of medical treatment (1 = patient survives; 0 = patient dies). However, sometimes we encounter artificial dichotomies. For instance, when we take a set of quantitative exam scores that range from 15 to 82 and impose an arbitrary cutoff (scores less than 65 are fail, scores greater than 65 are pass), this type of dichotomy is “artificial.” The researcher has lost some of the information about variability of scores by artificially converting them to a dichotomous group membership variable.
When a dichotomous variable is an artificially created dichotomy, there are special types of correlation; their computational formulas involve terms that attempt to correct for the information about variability that was lost in the artificial dichotomization. (Fitzsimons, 2008, has argued that researchers should never create artificial dichotomies because of the loss of information; however, some journal articles do report artificially dichotomized scores.) The correlation of an artificial dichotomy with a quantitative variable is called a biserial r (rb); the correlation between two artificial dichotomies is called a tetrachoric r (rtet). These are not examples of Pearson’s r; they use quite different computational procedures and are rarely used.
8.4.1 Point Biserial r (rpb)
If a researcher has data on a true dichotomous variable (such as gender) and a continuous variable (such as emotional intelligence, EI), the relationship between these two variables can be assessed by calculating a t test to assess the difference in mean EI for the male versus female groups or by calculating a point biserial r to describe the increase in EI scores in relation to scores on gender. The values of t and rpb are related, and each can easily be converted into the other. The t value can be compared with critical values of t to assess statistical significance. The rpb value can be interpreted as a standardized index of effect size, or the strength of the relationship between group membership and scores on the outcome variable:
and
In this equation, df = N – 2; N = total number of subjects. The sign of rpb can be determined by looking at the direction of change in Y across levels of X. This conversion between rpb and t is useful because t can be used to assess the statistical significance of rpb, and rpb or can be used as an index of the effect size associated with t.
To illustrate the correspondence between r.pb and t, SPSS was used to run two different analyses on the hypothetical data shown in Figure 8.1 . First, the independent samples t test was run to assess the significance of the difference of mean height for the male versus female groups (the procedures for running an independent samples t test using SPSS were presented in Chapter 5 ). The results are shown in the top two panels of Figure 8.2 . The difference in mean height for males (M1 = 69.03) and females (M2 = 63.88) was statistically significant, t(66) = 9.69, p < .001. The mean height for females was about 5 in. lower than the mean height for males. Second, a point biserial correlation between height and gender was obtained by using the Pearson correlation procedure in SPSS: <Analyze> → <Correlate> → <Bivariate>. The results for this analysis are shown in the bottom panel of Figure 8.2 ; the correlation between gender and height was statistically significant, rpb(66) = −.77, p < .001. The nature of the relationship was that having a higher score on gender (i.e., being female) was associated with a lower score on height. The reader may wish to verify that when these values are substituted into Equations 8.3 and 8.4 , the rpb value can be reproduced from the t value and the t value can be obtained from the value of rpb. Also, note that when η2 is calculated from the value of t as discussed in Chapter 5 , η2 is equivalent to .
Figure 8.2 SPSS Output: Independent Samples t Test (Top) and Pearson’s r (Bottom) for Data in Figure 8.1
This demonstration is one of the many places in the book where readers will see that analyses that were introduced in different chapters in most introductory statistics textbooks turn out to be equivalent. This occurs because most of the statistics that we use in the behavioral sciences are special cases of a larger data analysis system called the general linear model. In the most general case, the general linear model may include multiple predictor and multiple outcome variables, and it can include one or more quantitative and dichotomous variables on the predictor side of the analysis and one or more quantitative or measured variables as outcome variables (Tatsuoka, 1993). Thus, when we predict a quantitative Y from a quantitative X variable, or a quantitative Y from a categorical X variable, these are special cases of the general linear model where we limit the number and type of variables on one or both sides of the analysis (the predictor and the dependent variable). (Note that there is a difference between the general linear model and the generalized linear model; only the general model, which corresponds to the SPSS general linear model or GLM procedure, is covered in this book.)
8.4.2 Phi Coefficient (Φ)
The phi coefficient (Φ) is the version of Pearson’s r that is used when both X and Y are true dichotomous variables. It can be calculated from the formulas given earlier for the general Pearson’s r using score values of 0 and 1, or 1 and 2, for the group membership variables; the exact numerical value codes that are used do not matter, although 0, 1 is the most conventional representation. Alternatively, phi can be computed from the cell frequencies in a 2 × 2 table that summarizes the number of cases for each combination of X and Y scores. Table 8.5 shows the way the frequencies of cases in the four cells of a 2 × 2 table are labeled to compute phi from the cell frequencies. Assuming that the cell frequencies a through d are as shown in Table 8.5 (i.e., a and d correspond to “discordant” outcomes and b and c correspond to “concordant” outcomes), here is a formula that may be used to compute phi directly from the cell frequencies:
where b and c are the number of cases in the concordant cells of a 2 × 2 table and a and d are the number of cases in the discordant cells of a 2 × 2 table.
In Chapter 7 , you saw that the Pearson correlation turned out to be large and positive when most of the points fell into the concordant regions of the X, Y scatter plot that appeared in Figure 7.16 (high values of X paired with high values of Y and low values of X paired with low values of Y). Calculating products of z scores was a way to summarize the information about score locations in the scatter plot and to assess whether most cases were concordant or discordant on X and Y. The same logic is evident in the formula to calculate the phi coefficient. The b × c product is large when there are many concordant cases; the a × d product is large when there are many discordant cases. The phi coefficient takes on its maximum value of +1 when all the cases are concordant (i.e., when the a and d cells have frequencies of 0). The Φ coefficient is 0 when b × c = a × d—that is, when there are as many concordant as discordant cases.
Table 8.5 Labels for Cell Frequencies in a 2 × 2 Contingency Table (a) as Shown in Most Textbooks and (b) as Shown in Crosstab Tables From SPSS
NOTES: Cases are called concordant if they have high scores on both X and Y or low scores on both X and Y. Cases are called discordant if they have low scores on one variable and high scores on the other variable. a = Number of cases with X low and Y high (discordant), b = number of cases with X high and Y high (concordant), c = number of cases with Y low and X low (concordant), and d = number of cases with X low and Y high (discordant). In textbook presentations of the phi coefficient, the 2 × 2 table is usually oriented so that values of X increase from left to right and values of Y increase from bottom to top (as they would in an X, Y scatter plot). However, in the Crosstabs tables produced by SPSS, the arrangement of the rows is different (values of Y increase as you read down the rows in an SPSS table). If you want to calculate a phi coefficient by hand from the cell frequencies that appear in the SPSS Crosstabs output, you need to be careful to look at the correct cells for information about concordant and discordant cases. In most textbooks, as shown in this table, the concordant cells b and c are in the major diagonal of the 2 × 2 table—that is, the diagonal that runs from lower left to upper right. In SPSS Crosstabs output, the concordant cells b and c are in the minor diagonal—that is, the diagonal that runs from upper left to lower right.
A formal significance test for phi can be obtained by converting it into a chi-square; in the following equation, N represents the total number of scores in the contingency table:
This is a chi-square statistic with 1 degree of freedom (df). Those who are familiar with chi-square from other statistics courses will recognize it as one of the many possible statistics to describe relationships between categorical variables based on tables of cell frequencies. For χ2 with 1 df and α = .05, the critical value of χ2 is 3.84; thus, if the obtained χ2 exceeds 3.84, then phi is statistically significant at the .05 level.
When quantitative X and Y variables have different distribution shapes, it limits the maximum possible size of the correlation between them because a perfect one-to-one mapping of score location is not possible when the distribution shapes differ. This issue of distribution shape also applies to the phi coefficient. If the proportions of yes/no or 0/1 codes on the X and Y variables do not match (i.e., if p1, the probability of a yes code on X, does not equal p2, the probability of a yes code on Y), then the maximum obtainable size of the phi coefficient may be much less than 1 in absolute value. This limitation on the magnitude of phi occurs because unequal marginal frequencies make it impossible to have 0s in one of the diagonals of the table (i.e., in a and d, or in b and c).
For example, consider the hypothetical research situation that is illustrated in Table 8.6 . Let’s assume that the participants in a study include 5 dead and 95 live subjects and 40 Type B and 60 Type A personalities and then try to see if it is possible to arrange the 100 cases into the four cells in a manner that results in a diagonal pair of cells with 0s in it. You will discover that it can’t be done. You may also notice, as you experiment with arranging the cases in the cells of Table 8.6 , that there are only six possible outcomes for the study—depending on the way the 5 dead people are divided between Type A and Type B personalities; you can have 0, 1, 2, 3, 4, or 5 Type A/dead cases, and the rest of the cell frequencies are not free to vary once you know the number of cases in the Type A/dead cell. 1
It is possible to calculate the maximum obtainable size of phi as a function of the marginal distributions of X and Y scores and to use this as a point of reference in evaluating whether the obtained phi coefficient was relatively large or small. The formula for Φmax is as follows:
That is, use the larger of the values p1 and p2 as pj in the formula above. For instance, if we correlate an X variable (coronary-prone personality, coded Type A = 1, Type B = 0, with a 60%/40% split) with a Y variable (death from heart attack, coded 1 = dead, 0 = alive, with a 5%/95% split), the maximum possible Φ that can be obtained in this situation is about .187 (see Table 8.6 ).
Table 8.6 Computation of Φmax for a Table With Unequal Marginals
NOTE: To determine what the maximum possible value of Φ is given these marginal probabilities, apply Equation 8.7 :
Because p1 (.60) > p2 (.05), we let pj = .60, qj = .40; pi = .05, qi = .95:
Essentially, Φmax is small when the marginal frequencies are unequal because there is no way to arrange the cases in the cells that would make the frequencies in both of the discordant cells equal zero. One reason why correlations between measures of personality and disease outcomes are typically quite low is that, in most studies, the proportions of persons who die, have heart attacks, or have other specific disease outcomes of interest are quite small. If a predictor variable (such as gender) has a 50/50 split, the maximum possible correlation between variables such as gender and heart attack may be quite small because the marginal frequency distributions for the variables are so different. This limitation is one reason why many researchers now prefer other ways of describing the strength of association, such as the odds ratios that can be obtained using binary logistic regression (see Chapter 21 ).
8.5 Correlations for Artificially Dichotomized Variables
Artificial dichotomies arise when researchers impose an arbitrary cutoff point on continuous scores to obtain groups; for instance, students may obtain a continuous score on an exam ranging from 1 to 100, and the teacher may impose a cutoff to determine pass/fail status (1 = pass, for scores of 70 and above; 0 = fail, for scores of 69 and below). Special forms of correlation may be used for artificial dichotomous scores (biserial r, usually denoted by rb, and tetrachoric r, usually denoted by rtet). These are rarely used; they are discussed only briefly here.
8.5.1 Biserial r (rb)
Suppose that the artificially dichotomous Y variable corresponds to a “pass” or “fail” decision. Let MXp be the mean of the quantitative X scores for the pass group and p be the proportion of people who passed; let MXq be the mean of the X scores for the fail group and q be the proportion of people who failed. Let h be the height of the normal distribution at the point where the pass/fail cutoff was set for the distribution of Y. Let sX be the standard deviation of all the X scores. Then,
(from Lindeman et al., 1980, p. 74). Tables for the height of the normal distribution curve h are not common, and therefore this formula is not very convenient for by-hand computation.
8.5.2 Tetrachoric r (rtet)
Tetrachoric r is a correlation between two artificial dichotomies. The trigonometric functions included in this formula provide an approximate adjustment for the information about the variability of scores, which is lost when variables are artificially dichotomized; these two formulas are only approximations; the exact formula involves an infinite series.
The cell frequencies are given in the following table:
where b and c are the concordant cases (the participant has a high score on X and a high score on Y, or a low score on X and a low score on Y); a and d are the discordant cases (the participant has a low score on X and a high score on Y, or a high score on X and a low score on Y), and n = the total number of scores, n = a + b + c + d.
If there is a 50/50 split between the number of 0s and the number of 1s on both the X and the Y variables (this would occur if the artificial dichotomies were based on median splits)—that is, if (a + b) = (c + d) and (a + c) = (b + d), then an exact formula for tetrachoric r is as follows:
However, if the split between the 0/1 groups is not made by a median split on one or both variables, a different formula provides an approximation for tetrachoric r that is a better approximation for this situation:
(from Lindeman et al., 1980, p. 79).
8.6 Assumptions and Data Screening for Dichotomous Variables
For a dichotomous variable, the closest approximation to a normal distribution would be a 50/50 split (i.e., half zeros and half ones). Situations where the group sizes are extremely unequal (e.g., 95% zeros and 5% ones) should be avoided for two reasons. First, when the absolute number of subjects in the smaller group is very low, the outcome of the analysis may be greatly influenced by scores for one or a few cases. For example, in a 2 × 2 contingency table, when a row has a very small total number of cases (such as five), then the set of cell frequencies in the entire overall 2 × 2 table will be entirely determined by the way the five cases in that row are divided between the two column categories. It is undesirable to have the results of your study depend on the behavior of just a few scores. When chi-square is applied to contingency tables, the usual rule is that no cell should have an expected cell frequency less than 5. A more appropriate analysis for tables where some rows or columns have very small Ns and some cells have expected frequencies less than 5 is the Fisher exact test (this is available through the SPSS Crosstabs procedure when the table is 2 × 2 in size). Second, the maximum possible value of the phi coefficient is constrained to be much smaller than +1 or –1 when the proportions of ones for the X and Y variables are far from equal.
8.7 Analysis of Data: Dog Ownership and Survival After a Heart Attack
Friedmann et al. (1980) reported results from a survey of patients who had a first heart attack. The key outcome of interest was whether or not the patient survived at least 1 year (coded 0 = no, 1 = yes). One of the variables they assessed was dog ownership (0 = no, 1 = yes). The results for this sample of 92 patients are shown in Table 8.4 . Three statistics will be computed for this table, to assess the relationship between pet ownership and survival: a phi coefficient computed from the cell frequencies in this table; a Pearson correlation calculated from the 0, 1 scores; and a chi-square test of significance. Using the formula in Equation 8.5 , the phi coefficient for the data in Table 8.4 is .310. The corresponding chi-square, calculated from Equation 8.6 , is 8.85. This chi-square exceeds the critical value of chi-square for a 1-df table (χ2 critical = 3.84), so we can conclude that there is a significant association between pet ownership and survival. Note that the phi coefficient is just a special case of Pearson’s r, so the value of the obtained correlation between pet ownership and survival will be the same whether it is obtained from the SPSS bivariate Pearson correlation procedure or as a Φ coefficient from the SPSS Crosstabs procedure.
Although it is possible to calculate chi-square by using the values of Φ and N, it is also instructive to consider another method for the computation of chi-square, a method based on the sizes of the discrepancies between observed frequencies and expected frequencies that are based on a null hypothesis that the row and column variables are not related. We will reanalyze the data in Table 8.4 and compute chi-square directly from the cell frequencies. Our notation for the observed frequency of scores in each cell will be O; the expected cell frequency for each cell is denoted by E. The expected cell frequency is the number of observations that are expected to fall in each cell under the null hypothesis that the row and column variables are independent. These expected values for E are generated from a simple model that tells us what cell frequencies we would expect to see if the row and column variables were independent.
First, we need to define independence between events A (such as owning a dog) and B (surviving 1 year after a heart attack). If Pr(A) = Pr(A|B)—that is, if the unconditional probability of A is the same as the conditional probability of A given B, then A and B are independent. Let’s look again at the observed frequencies given in Table 8.4 for pet ownership and coronary disease patient survival. The unconditional probability that any patient in the study will be alive at the end of 1 year is denoted by Pr(alive at the end of 1 year) and is obtained by dividing the number of persons alive by the total N in the sample; this yields 78/92 or .85. In the absence of any other information, we would predict that any randomly selected patient has about an 85% chance of survival. Here are two of the conditional probabilities that can be obtained from this table. The conditional probability of surviving 1 year for dog owners is denoted by Pr(survived 1 year|owner of dog); it is calculated by taking the number of dog owners who survived and dividing by the total number of dog owners, 50/53, which yields .94. This is interpreted as a 94% chance of survival for dog owners. The conditional probability of survival for nonowners is denoted by Pr(survived 1 year|nonowner of dog); it is calculated by taking the number of dog nonowners who survived and dividing by the total number of nonowners of dogs; this gives 28/39 or .72—that is, a 72% chance of survival for nonowners of dogs. If survival were independent of dog ownership, then these three probabilities should all be equal: Pr(alive|owner) = Pr(alive|nonowner) = Pr(alive). For this set of data, these three probabilities are not equal. In fact, the probability of surviving for dog owners is higher than for nonowners and higher than the probability of surviving in general for all persons in the sample. We need a statistic to help us evaluate whether this difference between the conditional and unconditional probabilities is statistically significant or whether it is small enough to be reasonably attributed to sampling error. In this case, we can evaluate significance by setting up a model of the expected frequencies we should see in the cells if ownership and survival were independent.
For each cell, the expected frequency, E—the number of cases that would be in that cell if group membership on the row and column variables were independent—is obtained by taking (Row total × Column total)/Table total N. For instance, for the dog owner/alive cell, the expected frequency E = (Number of dog owners × Number of survivors)/Total N in the table. Another way to look at this computation for E is that E = Column total × (Row total/Total N); that is, we take the total number of cases in a column and divide it so that the proportion of cases in Row 1 equals the proportion of cases in Row 2. For instance, the expected number of dog owners who survive 1 year if survival is independent of ownership = Total number of dog owners × Proportion of all people who survive 1 year = 53 × (78/92) = 44.9. That is, we take the 53 dog owners and divide them into the same proportions of survivors and nonsurvivors as in the overall table. These expected frequencies, E, for the dog ownership data are summarized in Table 8.7 .
Note that the Es (expected cell frequencies if H0 is true and the variables are not related) in Table 8.7 sum to the same marginal frequencies as the original data in Table 8.4 . All we have done is reapportion the frequencies into the cells in such a way that Pr(A) = Pr(A|B). That is, for this table of Es,
Pr(survived 1 year) = 78/92 = .85,
Pr(survived 1 year|owner) = 44.9/53 = .85, and
Pr(survived 1 year|nonowner) = 33.1/39 = .85.
In other words, if survival is not related to ownership of a dog, then the probability of survival should be the same in the dog owner and non-dog-owner groups, and we have figured out what the cell frequencies would have to be to make those probabilities or proportions equal.
Table 8.7 Expected Cell Frequencies (If Dog Ownership and Survival Status Are Independent) for the Data in Tables 8.3 and 8.4
Next we compare the Es (the frequencies we would expect if owning a dog and survival are independent) and Os (the frequencies we actually obtained in our sample). We want to know if our actually observed frequencies are close to the ones we would expect if H0 were true; if so, it would be reasonable to conclude that these variables are independent. If Os are very far from Es, then we can reject H0 and conclude that there is some relationship between these variables. We summarize the differences between Os and Es across cells by computing the following statistic:
Note that the (O − E) deviations sum to zero within each row and column, which means that once you know the (O − E) deviation for the first cell, the other three (O − E) values in this 2 × 2 table are not free to vary. In general, for a table with r rows and c columns, the number of independent deviations (O − E) = (r −1 )(c − 1), and this is the df for the chi-square. For a 2 × 2 table, df = 1.
In this example,
This obtained value agrees with the value of chi-square that we computed earlier from the phi coefficient, and it exceeds the critical value of chi-square that cuts off 5% in the right-hand tail for the 1-df distribution of χ2 (critical value = 3.84). Therefore, we conclude that there is a statistically significant relation between these variables, and the nature of the relationship is that dog owners have a significantly higher probability of surviving 1 year after a heart attack (about 94%) than nonowners of dogs (72%). Survival is not independent of pet ownership; in fact, in this sample, there is a significantly higher rate of survival for dog owners.
The most widely reported effect size for the chi-square test of association is Cramer’s V. Cramer’s V can be calculated for contingency tables with any number of rows and columns. For a 2 × 2 table, Cramer’s V is equal to the absolute value of phi. Values of Cramer’s V range from 0 to 1 regardless of table size (but only if the row marginal totals equal the column marginal totals). Values close to 0 indicate no association; values close to 1 indicate strong association:
where chi-square is computed from Equation 8.11 , n is the total number of scores in the sample, and m is the minimum of (Number of rows – 1), (Number of columns – 1).
The statistical significance of Cramer’s V can be assessed by looking at the associated chi-square; Cramer’s V can be reported as effect-size information for a chi-square analysis. Cramer’s V is a symmetrical index of association ; that is, it does not matter which is the independent variable.
Chi-square goodness-of-fit tests can be applied to 2 × 2 tables (as in this example); they can also be applied to contingency tables with more than two rows or columns. Chi-square also has numerous applications later in statistics as a generalized goodness-of-fit test. 2 Although chi-square is commonly referred to as a “goodness-of-fit” test, note that the higher the chi-square value, in general, the worse the agreement between the model used to generate expected values that correspond to some model and the observed data. When chi-square is applied to contingency tables, the expected frequencies generated by the model correspond to the null hypothesis that the row and column variables are independent. Therefore, a chi-square large enough to be judged statistically significant is a basis for rejection of the null hypothesis that group membership on the row variable is unrelated to group membership on the column variable.
When chi-square results are reported, the write-up should include the following:
1. A table that shows the observed cell frequencies and either row or column percentages (or both).
2. The obtained value of chi-square, its df, and whether it is statistically significant.
3. An assessment of the effect size; this can be phi, for a 2 × 2 table; other effect sizes such as Cramer’s V are used for larger tables.
4. A statement about the nature of the relationship, stated in terms of differences in proportions or of probabilities. For instance, in the pet ownership example, the researchers could say that the probability of surviving for 1 year is much higher for owners of dogs than for people who do not own dogs.
8.8 Chi-Square Test of Association (Computational Methods for Tables of Any Size)
The method of computation for chi-square described in the preceding section can be generalized to contingency tables with more than two rows and columns. Suppose that the table has r rows and c columns. For each cell, the expected frequency, E, is computed by multiplying the corresponding row and column total Ns and dividing this product by the N of cases in the entire table. For each cell, the deviation between O (observed) and E (expected) frequencies is calculated, squared, and divided by E (expected frequency for that cell). These terms are then summed across the r × c cells. For a 2 × 3 table, for example, there are six cells and six terms included in the computation of chi-square. The df for the chi-square test on an r × c table is calculated as follows:
Thus, for instance, the degrees of freedom for a 2 × 3 table is (2 − 1) × (3 − 1) = 2 df. Only the first two (O – E) differences between observed and expected frequencies in a 2 × 3 table are free to vary. Once the first two deviations are known, the remaining deviations are determined because of the requirement that (O – E) sum to zero down each column and across each row of the table. Critical values of chi-square for any df value can be found in the table in Appendix D .
8.9 Other Measures of Association for Contingency Tables
Until about 10 years ago, the most widely reported statistic for the association between categorical variables in contingency tables was the chi-square test of association (sometimes accompanied by Φ or Cramer’s V as effect-size information). The chi-square test of association is still fairly widely reported. However, many research situations involve prediction of outcomes that have low base rates (e.g., fewer than 100 out of 10,000 patients in a medical study may die of coronary heart disease). The effect-size indexes most commonly reported for chi-square, such as phi, are constrained to be less than +1.00 when the marginal distribution for the predictor variable differs from the marginal distribution of the outcome variable; for instance, in some studies, about 60% of patients have the Type A coronary-prone personality, but only about 5% to 10% of the patients develop heart disease. Because the marginal distributions (60/40 split on the personality predictor variable vs. 90/10 or 95/5 split on the outcome variable) are so different, the maximum possible value of phi or Pearson’s r is restricted; even if there is a strong association between personality and disease, phi cannot take on values close to +1 when the marginal distributions of the X and Y variables are greatly different. In such situations, effect-size measures, such as phi, that are marginal dependent may give an impression of effect size that is unduly pessimistic.
Partly for this reason, different descriptions of association are often preferred in clinical studies; in recent years, odds ratios have become the most popular index of the strength of association between a risk factor (such as smoking) and a disease outcome (such as lung cancer) or between a treatment and an outcome (such as survival). Odds ratios are usually obtained as part of a binary logistic regression analysis. A brief definition of odds ratios is provided in the glossary, and a more extensive explanation of this increasingly popular approach to summarizing information from 2 × 2 tables that correlate risk and outcome (or intervention and outcome) is provided in Chapter 23 .
In addition, dozens of other statistics may be used to describe the patterns of scores in contingency tables. Some of these statistics are applicable only to tables that are 2 × 2; others can be used for tables with any number of rows and columns. Some of these statistics are marginal dependent, while others are not dependent on the marginal distributions of the row and column variables. Some of these are symmetric indexes, while others (such as lambda and Somers’s d) are asymmetric; that is, they show a different reduction in uncertainty for prediction of Y from X than for prediction of X from Y. The McNemar test is used when a contingency table corresponds to repeated measures—for example, participant responses on a binary outcome variable before versus after an intervention. A full review of these many contingency table statistics is beyond the scope of this book; see Everitt (1977) and Liebetrau (1983) for more comprehensive discussion of contingency table analysis.
8.10 SPSS Output and Model Results Write-Up
Two SPSS programs were run on the data in Table 8.4 to verify that the numerical results obtained by hand earlier were correct. The SPSS Crosstabs procedure was used to compute phi and chi-square (this program also reports numerous other statistics for contingency tables). The SPSS bivariate correlation procedure (as described earlier in Chapter 7 ) was also applied to these data to obtain a Pearson’s r value.
To enter the dog owner/survival data into SPSS, one column was used to represent each person’s score on dog ownership (coded 0 = did not own dog, 1 = owned dog), and a second column was used to enter each person’s score on survival (0 = did not survive for 1 year after heart attack, 1 = survived for at least 1 year). The number of lines with scores of 1, 1 in this dataset corresponds to the number of survivors who owned dogs. The complete set of data for this SPSS example appears in Table 8.3 .
The SPSS menu selections to run the Crosstabs procedure were as follows (from the top-level menu, make these menu selections, as shown in Figure 8.3 ): <Analyze> → <Descriptive Statistics> → <Crosstabs>.
This opens the SPSS dialog window for the Crosstabs procedure, shown in Figure 8.4 . The names of the row and column variables were placed in the appropriate windows. In this example, the row variable corresponds to the score on the predictor variable (dog ownership), and the column variable corresponds to the score on the outcome variable (survival status). The Statistics button was clicked to access the menu of optional statistics to describe the pattern of association in this table, as shown in Figure 8.5 . The optional statistics selected included chi-square, phi, and Cramer’s V. In addition, the Cells button in the main Crosstabs dialog window was used to open up the Crosstabs Cell Display menu, which appears in Figure 8.6 . In addition to the observed frequency for each cell, the expected frequencies for each cell and row percentages were requested.
The output from the Crosstabs procedure for these data appears in Figure 8.7 . The first panel shows the contingency table with observed and expected cell frequencies and row percentages. The second panel reports the obtained value of χ2 (8.85) and some additional tests. The third panel in Figure 8.7 reports the symmetric measures of association that were requested, including the value of Φ (.310) and that of Cramer’s V (also .310).
Figure 8.3 Menu Selections for Crosstabs Procedure
Figure 8.4 SPSS Crosstabs Main Dialog Window
In addition, a Pearson correlation was calculated for the scores on dog ownership and survival status, using the same procedure as in Chapter 7 to obtain a correlation: <Analyze> → <Correlation> → <Bivariate>. Pearson’s r (shown in Figure 8.8) is .310; this is identical to the value reported for phi using the Crosstabs procedure above.
Figure 8.5 SPSS Crosstabs Statistics Dialog Window
Figure 8.6 SPSS Crosstabs Cell Display Dialog Window
Figure 8.7 SPSS Output From Crosstabs Procedure for Dog/Survival Status Data in Tables 8.3 and 8.4
Figure 8.8 SPSS Output From Pearson Correlation Procedure for Dog/Survival Status Data in Tables 8.3 and 8.4
Results
A survey was done to assess numerous variables that might predict survival for 1 year after a first heart attack; there were 92 patients in the study. Only one predictor variable is reported here: dog ownership. Expected cell frequencies were examined to see whether there were any expected frequencies less than 5; the smallest expected cell frequency was 5.9. (If there were one or more cells with expected frequencies less than 5, it would be preferable to report the Fisher exact test rather than chi-square.) Table 8.4 shows the observed cell frequencies for dog ownership and survival status. Of the 53 dog owners, 3 did not survive; of the 39 nonowners of dogs, 11 did not survive. A phi coefficient was calculated to assess the strength of this relationship: Φ = .310. This corresponds to a medium-size effect. This was a statistically significant association : χ2(1) = 8.85, p < .05. This result was also statistically significant by the Fisher exact test, p = .006. The nature of the relationship was that dog owners had a significantly higher proportion of survivors (94%) than non–dog owners (72%). Because this study was not experimental, it is not possible to make a causal inference.
8.11 Summary
This chapter provided information about different forms of correlation that are appropriate when X and Y are rank/ordinal or when one or both of these variables are dichotomous. This chapter demonstrated that Pearson’s r can be applied in research situations where one or both of the variables are true dichotomies. This is important because it means that true dichotomous variables may be used in many other multivariate analyses that build on variance partitioning and use covariance and correlation as information about the way variables are interrelated.
The chi-square test of association for contingency tables was presented in this chapter as a significance test that can be used to evaluate the statistical significance of the phi correlation coefficient. However, chi-square tests have other applications, and it is useful for students to understand the chi-square as a general goodness-of-fit test; for example, chi-square is used as one of the numerous goodness-of-fit tests for structural equation models.
This chapter described only a few widely used statistics that can be applied to contingency tables. There are many other possible measures of association for contingency tables; for further discussion, see Everitt (1977) or Liebetrau (1983). Students who anticipate that they will do a substantial amount of research using dichotomous outcome variables should refer to Chapter 23 in this book for an introductory discussion of binary logistic regression; logistic regression is presently the most widely used analysis for this type of data. For categorical outcome variables with more than two categories, polytomous logistic regression can be used (Menard, 2001). In research situations where there are several categorical predictor variables and one categorical outcome variable, log linear analysis is often reported.
Notes
1. In other words, conclusions about the outcome of this study depend entirely on the outcomes for these five individuals, regardless of the size of the total N for the table (and it is undesirable to have a study where a change in outcome for just one or two participants can greatly change the nature of the outcome).
2. There are other applications of chi-square apart from its use to evaluate the association between row and column variables in contingency tables. For example, in structural equation modeling (SEM), chi-square tests are performed to assess how much the variance/covariance matrix that is reconstructed from SEM parameters differs from the original variance/covariance matrix calculated from the scores. A large chi-square for an SEM model is interpreted as evidence that the model is a poor fit—that is, the model does not do a good job of reconstructing the variances and covariances.
Comprehension Questions
1. |
How are point biserial r (rpb) and the phi coefficient different from Pearson’s r? |
2. |
How are biserial r (rb) and tetrachoric r (rtet) different from Pearson’s r? |
3. |
Is high blood pressure diagnosis (defined as high blood pressure = 1 = systolic pressure equal to or greater than 140 mm Hg, low blood pressure = 0 = systolic pressure less than 140 mm Hg) a true dichotomy or an artificial dichotomy? |
4. |
The data in the table below were collected in a famous social-psychological field experiment. The researchers examined a common source of frustration for drivers: a car stopped at a traffic light that fails to move when the light turns green. The variable they manipulated was the status of the frustrating car (1 = high status, expensive, new; 0 = low status, inexpensive, old). They ran repeated trials in which they stopped at a red light, waited for the light to turn green, and then did not move the car; they observed whether the driver in the car behind them honked or not (1 = honked, 0 = did not honk). They predicted that people would be more likely to honk at low-status cars than at high-status cars (Doob & Gross, 1968). This table reports part of their results: |
|
a. |
Calculate phi and chi-square by hand for the table above, and write up a Results section that describes your findings and notes whether the researchers’ prediction was upheld. |
|
b. |
Enter the data for this table into SPSS. To do this, create one variable in the SPSS worksheet that contains scores of 0 or 1 for the variable status and another variable in the SPSS worksheet that contains scores of 0 or 1 for the variable honking (e.g., because there were 18 people who honked at a high-status car, you will enter 18 lines with scores of 1 on the first variable and 1 on the second variable). |
|
c. |
Using SPSS, do the following: Run the Crosstabs procedure and obtain both phi and chi-square; also, run a bivariate correlation (and note how the obtained bivariate correlation compares with your obtained phi). |
|
d. |
In this situation, given the marginal frequencies, what is the maximum possible value of phi? |
|
e. |
The researchers manipulated the independent variable (status of the car) and were careful to control for extraneous variables. Can they make a causal inference from these results? Give reasons for your answer. |
5. |
When one or both of the variables are dichotomous, Pearson’s r has specific names; for example, when a true dichotomy is correlated with a quantitative variable, what is this correlation called? When two true dichotomous variables are correlated, what is this correlation called? |
|
6. |
What information should be included in the report of a chi-square test of contingency? |
|
7. |
The table below gives the percentage of people who were saved (vs. lost) when the Titanic sank. The table provides information divided into groups by class (first class, second class, third class, and crew) and by gender and age (children, women, men). |
Titanic Disaster—Official Casualty Figures
SOURCE: British Parliamentary Papers, Shipping Casualties (Loss of the Steamship ‘Titanic”). 1912, cmd. 6352, ‘Report of a Formal Investigation into the circumstances attending the foundering on the 15th April, 1912, of the British Steamship ‘Titanic,” of Liverpool, after striking ice in or near Latitude 41 = 46’ N., Longitude 50 = 14’ W., North Atlantic Ocean, whereby loss of life ensued.’ (London: His Majesty’s Stationery Office, 1912), page 42.
|
The information in the table is sufficient to set up some simple chi-square tests. |
|
For example, let’s ask, Was there a difference in the probability of being saved for women passengers in first class versus women passengers in third class? There were a total of 309 women in first and third class. The relevant numbers from the table on page 336 appear in the table below. |
|
Compute a phi coefficient using the observed cell frequencies in the table above. |
|
Also, compute a chi-square statistic for the observed frequencies in the table above. Write up your results in paragraph form. |
|
Was there a statistically significant association between being in first class and being saved when we look at the passenger survival data from the Titanic? How strong was the association between class and outcome (e.g., how much more likely were first-class women passengers to be saved than were third-class women passengers)? |

Get help from top-rated tutors in any subject.
Efficiently complete your homework and academic assignments by getting help from the experts at homeworkarchive.com