Proc freq | SAS Annotated Output

Below we show the SAS code and the output for proc freq. We have used the hsb2 data set. We have made a two-way table with a three-level categorical variable (ses) and a two-level categorical variable (female). Remember that you do not want to use a continuous variable in a proc freq, because each value of the variable will be used and the output can get to be very long.

proc freq data = "D:\hsb2";
tables ses*female / expected chisq;
run;

The FREQ Procedure

Table of ses by female

ses       female

Frequency|
Expected |
Percent  |
Row Pct  |
Col Pct  |       0|       1|  Total
---------+--------+--------+
       1 |     15 |     32 |     47
         | 21.385 | 25.615 |
         |   7.50 |  16.00 |  23.50
         |  31.91 |  68.09 |
         |  16.48 |  29.36 |
---------+--------+--------+
       2 |     47 |     48 |     95
         | 43.225 | 51.775 |
         |  23.50 |  24.00 |  47.50
         |  49.47 |  50.53 |
         |  51.65 |  44.04 |
---------+--------+--------+
       3 |     29 |     29 |     58
         |  26.39 |  31.61 |
         |  14.50 |  14.50 |  29.00
         |  50.00 |  50.00 |
         |  31.87 |  26.61 |
---------+--------+--------+
Total          91      109      200
            45.50    54.50   100.00

Statistics for Table of ses by female

Statistic                     DF       Value      Prob
------------------------------------------------------
Chi-Square                     2      4.5765    0.1014
Likelihood Ratio Chi-Square    2      4.6789    0.0964
Mantel-Haenszel Chi-Square     1      3.1098    0.0778
Phi Coefficient                       0.1513
Contingency Coefficient               0.1496
Cramer's V                            0.1513

Sample Size = 200

Table of frequencies

The FREQ Procedure

Table of ses by female^a

ses       female

Frequency^b|
Expected^c|
Percent^d |
Row Pct^e |
Col Pct^f |       0|       1|  Total^g
---------+--------+--------+
       1 |     15 |     32 |     47
         | 21.385 | 25.615 |
         |   7.50 |  16.00 |  23.50
         |  31.91 |  68.09 |
         |  16.48 |  29.36 |
---------+--------+--------+
       2 |     47 |     48 |     95
         | 43.225 | 51.775 |
         |  23.50 |  24.00 |  47.50
         |  49.47 |  50.53 |
         |  51.65 |  44.04 |
---------+--------+--------+
       3 |     29 |     29 |     58
         |  26.39 |  31.61 |
         |  14.50 |  14.50 |  29.00
         |  50.00 |  50.00 |
         |  31.87 |  26.61 |
---------+--------+--------+
Total^g         91      109      200
            45.50    54.50   100.00

a. Table of – This is the title of the table. The first variable listed will be the row variable and the second variable will be the column variable.

b. Frequency – This is the observed cell frequency. It is also called count. For example, there are 15 males (female=0) in the low socioeconomic status group. The observed cell frequencies and the expected cell frequencies are used to test if the row and the column variables are independent.

c. Expected – This is the cell frequency expected under the null hypothesis that the row and column variables are independent. This number is produced by using the option expected in the tables statement. Comparing the expected cell frequency with the observed frequency we should have some idea about whether the row variable is independent of the column variable.

d. Percent – This is the percent of the total observations represented by the cell frequency. In the table above, we see that there are 15 males (female=0) in the low socioeconomic status group (ses=1). That represents 7.5% of the total number of observations. You can suppress this output by using the nopercent option on the tables statement.

e. Row Pct – This gives the percent of observations in the row. In the table above, we see that there are 15 males (female=0) and 32 females (female=1) in low socioeconomic status group. So the row percent for the first cell is 15/47*100=31.91. You can suppress this output by using the option norow in tables statement.

f. Col Pct – This gives the percent of observations in the column. In the table above, we see that there are 91 males and there are 15 males in the low socioeconomic status group. So the column percent for the first cell is 15/91*100=16.48. You can suppress this output by using the option nocolumn in the tables statement.

g. Total – This is the number of valid observations for the variable. The total number of observations is the sum of N and the number of missing values. If the sample size is not large enough, the test of independence of contingency tables such as Chi-square may not be accurate.

Statistics

Statistics for Table of ses by female

Statistic^h                    DF       Value      Prob
------------------------------------------------------
Chi-Squareⁱ                    2      4.5765    0.1014
Likelihood Ratio Chi-Square^j  2      4.6789    0.0964
Mantel-Haenszel Chi-Square^k   1      3.1098    0.0778
Phi Coefficient^l                     0.1513
Contingency Coefficient^m             0.1496
Cramer's Vⁿ                          0.1513

Sample Size = 200

h. Statistic – This part of the output is produced by SAS by using the option chisq on the tables statement. It consists of chi-square tests and statistics. They test the null hypothesis that there is no association between the row variable and the column variable. For measures of association, you can use measures option on the tables statement.

i. Chi-square – It is also known as Pearson chi-square test. It compares the observed frequencies with the expected frequencies collectively (considering the degree of freedom for each of the variables). The degrees of freedom for chi-square test is (R-1)*(C-1) where R is the number of rows and C the number of columns of the table. (In other words, the number of levels of each of the variables.) A large chi-square statistic will correspond to small p-value. If the p-value is small enough (say < 0.05), then we will reject the null hypothesis that the two variables are independent and conclude that there is an association between the row and the column variables.

j. Likelihood Ratio Chi-Square – This involves the ratio between the observed and the expected frequencies, whereas the ordinary chi-square test involves the difference between the two. This method was developed more recently than the chi-square test and is the second most widely used after the chi-square test. It is directly related to log-linear analysis and logistic regression. When the row and column variables are independent, the likelihood-ratio chi-square has an approximate chi-square distribution with (R-1)*(C-1) degrees of freedom where R is the number of rows and C the number of column of the table.

k. Mantel-Haenszel Chi-Square – It is also called the Mantel-Haenszel test for linear association. Unlike ordinary and likelihood ratio chi-square, it is an ordinal measure of significance. It is defined as (N-1)r² where r is the Pearson correlation between the row variable and the column variable. It is preferred when testing the significance of a linear relationship between two ordinal variables. If the test is significant, we say that increases in one variable are associated with increases (or decreases for negative relationships) in the other variable greater than would be expected by chance. Like other chi-square statistics, Mantel-Haeszel chi-square should not be used with tables with small cell counts.

l. Phi Coefficient – This is a measure of association based on adjusting chi-square significance to factor out sample size. The range of it is between -1 and 1 for 2-by-2 tables, and is between 0 and min(sqrt(R-1), sqrt(C-1)). Computationally, phi is the square root of chi-square divided by n, the sample size. The phi coefficient is often used as a measure of association in 2-by-2 tables formed by true dichotomies.

m. Contingency Coefficient – The contingency coefficient is an adjustment to phi coefficient, intended to adapt it to tables larger than 2-by-2. The contingency coefficient is computed as the square root of chi-square divided by chi-square plus n, the sample size. The contingency coefficient will be always less than 1 and will be approaching 1.0 only for large tables. The larger the contingency coefficient the stronger the association. Some researchers recommend it only for 5-by-5 tables or larger. For smaller tables it will underestimated the level of association.

n. Cramer’s V – Cramer’s V is the most popular of the chi-square-based measures of nominal association because it is designed so that the attainable upper limit is always 1. Cramer’s V equals the square root of chi-square divided by sample size, n, times m, which is the smaller of (rows – 1) or (columns – 1).