The hsb2 data set was used in this example, and the code used is given below. We first show the entire output; then we break the output into pieces and explain each part.
proc corr data = "D:\hsb2"; var read write math science female; run;
The CORR Procedure
5 Variables: read write math science female
Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum Label
read 200 52.23000 10.25294 10446 28.00000 76.00000 reading score write 200 52.77500 9.47859 10555 31.00000 67.00000 writing score math 200 52.64500 9.36845 10529 33.00000 75.00000 math score science 200 51.85000 9.90089 10370 26.00000 74.00000 science score female 200 0.54500 0.49922 109.00000 0 1.00000
Pearson Correlation Coefficients, N = 200 Prob > |r| under H0: Rho=0
read write math science female
read 1.00000 0.59678 0.66228 0.63016 -0.05308 reading score <.0001 <.0001 <.0001 0.4553
write 0.59678 1.00000 0.61745 0.57044 0.25649 writing score <.0001 <.0001 <.0001 0.0002
math 0.66228 0.61745 1.00000 0.63073 -0.02934 math score <.0001 <.0001 <.0001 0.6801
science 0.63016 0.57044 0.63073 1.00000 -0.12774 science score <.0001 <.0001 <.0001 0.0714
female -0.05308 0.25649 -0.02934 -0.12774 1.00000 0.4553 0.0002 0.6801 0.0714
Summary statistics
5 Variablesa: read write math science female
Simple Statistics
Variablea Nb Meanc Std Devd Sume Minimumf Maximumf Labelg
read 200 52.23000 10.25294 10446 28.00000 76.00000 reading score write 200 52.77500 9.47859 10555 31.00000 67.00000 writing score math 200 52.64500 9.36845 10529 33.00000 75.00000 math score science 200 51.85000 9.90089 10370 26.00000 74.00000 science score female 200 0.54500 0.49922 109.00000 0 1.00000
a. Variable – This gives the list of variables that were used to create the correlation matrix. This is the same list as that on the var statement in proc corr code above.
b. N – This is the number of valid (i.e., non-missing) cases used in the correlation. In this example, all 200 students had scores for all tests. By default, proc corr uses pairwise deletion for missing observations, meaning that a pair of observations (one from each variable in the pair being correlated) is included if both values are non-missing. If you use the nomiss option on the proc corr statement, proc corr uses listwise deletion and omits all observations with missing data on any of the named variables.
c. Mean – This is the mean (or average) of the variable.
d. Std Dev – This is the standard deviation of the variable.
e. Sum – This is the sum of the variable. This is the value obtained if you added up all of the values for that variable.
f. Minimum and Maximum – These are the smallest and largest values of the variable, respectively.
g. Label – This is the label of the variable (the variable label). Variable labels are a form of data documentation and usually provide additional information about what the variable is.
The correlation matrix
Pearson Correlation Coefficientsh, N = 200i Prob > |r| under H0: Rho=0j
read write math science female
read 1.00000 0.59678 0.66228 0.63016 -0.05308 reading score <.0001 <.0001 <.0001 0.4553
write 0.59678 1.00000 0.61745 0.57044 0.25649 writing score <.0001 <.0001 <.0001 0.0002
math 0.66228 0.61745 1.00000 0.63073 -0.02934 math score <.0001 <.0001 <.0001 0.6801
science 0.63016 0.57044 0.63073 1.00000 -0.12774 science score <.0001 <.0001 <.0001 0.0714
female -0.05308 0.25649 -0.02934 -0.12774 1.00000 0.4553 0.0002 0.6801 0.0714
h. Pearson Correlation Coefficients – These numbers measure the strength and direction of the linear relationship between the two variables. The correlation coefficient can range from -1 to +1, with -1 indicating a perfect negative correlation, +1 indicating a perfect positive correlation, and 0 indicating no correlation at all. (A variable correlated with itself will always have a correlation coefficient of 1.) You can think of the correlation coefficient as telling you the extent to which you can guess the value of one variable given a value of the other variable. From the scatterplot of the variables read and write below, we can see that the points tend along a line going from the bottom left to the upper right, which is the same as saying that the correlation is positive. The .59678 is the numerical description of how tightly around the imaginary line the points lie. If the correlation was higher, the points would tend to be closer to the line; if it was smaller, they would tend to be further away from the line.
i. N = 200 – This indicates that 200 observations were used in the correlation of each pair of variables.
j. Prob > |r| under H0: Rho=0 – This is the p-value and indicates the probability of observing this correlation coefficient or one more extreme under the null hypothesis (H0) that the correlation (Rho) is 0.
NOTE: The heading for this section is constructed in this way so that you know that the top number is the correlation coefficient and the bottom number is the p-value. Also, you can use either continuous or dichotomous (e.g., 0/1) variables in a Pearson correlation, but you should not use multi-level categorical variables, for example, four categories of type of car. The correlation coefficient can be misleading if the range of the variable is restricted. For example, if the science test was too easy for most students, the upper range of the scale would be restricted and the correlation coefficient would not reflect the true correlation between science and the other variables.
Scatterplot
Below we show a scatterplot, which is the graphical version of a correlation. You can make a scatterplot matrix just like you can make a correlation matrix. This graph shows you the strength and direction of the relationship between the two variables just like the correlation coefficient.
proc gplot data = "D:\hsb2"; plot read*write; run; quit;