Version info: Code for this page was tested in Stata 12.
This module will give a brief overview of some common statistical tests in Stata. Let’s use the auto data file that we will use for our examples.
sysuse auto
t-tests
Let’s do a t-test comparing the miles per gallon (mpg) of foreign and domestic cars.
ttest mpg , by(foreign) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- 0 | 52 19.82692 .657777 4.743297 18.50638 21.14747 1 | 22 24.77273 1.40951 6.611187 21.84149 27.70396 ---------+-------------------------------------------------------------------- combined | 74 21.2973 .6725511 5.785503 19.9569 22.63769 ---------+-------------------------------------------------------------------- diff | -4.945804 1.362162 -7.661225 -2.230384 ------------------------------------------------------------------------------ Degrees of freedom: 72 Ho: mean(0) - mean(1) = diff = 0 Ha: diff <0 Ha: diff ~="0" Ha: diff> 0 t = -3.6308 t = -3.6308 t = -3.6308 P < t = 0.0003 P > |t| = 0.0005 P > t = 0.9997
As you see in the output above, the domestic cars had significantly lower mpg (19.8) than the foreign cars (24.7).
Chi-square
Let’s compare the repair rating (rep78) of the foreign and domestic cars. We can make a crosstab of rep78 by foreign. We may want to ask whether these variables are independent. We can use the chi2 option to request a chi-square test of independence as well as the crosstab.
tabulate rep78 foreign, chi2| foreign rep78 | 0 1 | Total -----------+----------------------+---------- 1 | 2 0 | 2 2 | 8 0 | 8 3 | 27 3 | 30 4 | 9 9 | 18 5 | 2 9 | 11 -----------+----------------------+---------- Total | 48 21 | 69 Pearson chi2(4) = 27.2640 Pr = 0.000
The chi-square is not really valid when you have empty cells. In such cases when you have empty cells, or cells with small frequencies, you can request Fisher’s exact test with the exact option.
tabulate rep78 foreign, chi2 exact| foreign rep78 | 0 1 | Total -----------+----------------------+---------- 1 | 2 0 | 2 2 | 8 0 | 8 3 | 27 3 | 30 4 | 9 9 | 18 5 | 2 9 | 11 -----------+----------------------+---------- Total | 48 21 | 69 Pearson chi2(4) = 27.2640 Pr = 0.000 Fisher's exact = 0.000
Correlation
We can use the correlate command to get the correlations among variables. Let’s look at the correlations among price mpg weight and rep78. (We use rep78 in the correlation even though it is not continuous to illustrate what happens when you use correlate with variables with missing data.)
correlate price mpg weight rep78(obs=69) | price mpg weight rep78 ---------+------------------------------------ price | 1.0000 mpg | -0.4559 1.0000 weight | 0.5478 -0.8055 1.0000 rep78 | 0.0066 0.4023 -0.4003 1.0000
Note that the output above said (obs=69). The correlate command drops data on a listwise basis, meaning that if any of the variables are missing, then the entire observation is omitted from the correlation analysis.
We can use pwcorr (pairwise correlations) if we want to obtain correlations that deletes missing data on a pairwise basis instead of a listwise basis. We will use the obs option to show the number of observations used for calculating each correlation.
pwcorr price mpg weight rep78, obs
| price mpg weight rep78 ----------+------------------------------------ price | 1.0000 | 74 | mpg | -0.4686 1.0000 | 74 74 | weight | 0.5386 -0.8072 1.0000 | 74 74 74 | rep78 | 0.0066 0.4023 -0.4003 1.0000 | 69 69 69 69 |
Note how the correlations that involve rep78 have an N of 69 compared to the other correlations that have an N of 74. This is because rep78 has five missing values, so it only had 69 valid observations, but the other variables had no missing data so they had 74 valid observations.
Regression
Let’s look at doing regression analysis in Stata. For this example, let’s drop the cases where rep78 is 1 or 2 or missing.
drop if (rep78 <= 2) | (rep78==.)(15 observations deleted)
Now, let’s predict mpg from price and weight. As you see below, weight is a significant predictor of mpg, but price is not.
regress mpg price weightSource | SS df MS Number of obs = 59 ---------+------------------------------ F( 2, 56) = 47.87 Model | 1375.62097 2 687.810483 Prob > F = 0.0000 Residual | 804.616322 56 14.3681486 R-squared = 0.6310 ---------+------------------------------ Adj R-squared = 0.6178 Total | 2180.23729 58 37.5902981 Root MSE = 3.7905 ------------------------------------------------------------------------------ mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- price | -.0000139 .0002108 -0.066 0.948 -.0004362 .0004084 weight | -.005828 .0007301 -7.982 0.000 -.0072906 -.0043654 _cons | 39.08279 1.855011 21.069 0.000 35.36676 42.79882 ------------------------------------------------------------------------------
What if we wanted to predict mpg from rep78 as well. rep78 is really more of a categorical variable than it is a continuous variable. To include it in the regression, we should convert rep78 into dummy variables. Fortunately, Stata makes dummy variables easily using tabulate. The gen(rep) option tells Stata that we want to generate dummy variables from rep78 and we want the stem of the dummy variables to be rep.
tabulate rep78, gen(rep)rep78 | Freq. Percent Cum. ------------+----------------------------------- 3 | 30 50.85 50.85 4 | 18 30.51 81.36 5 | 11 18.64 100.00 ------------+----------------------------------- Total | 59 100.00
Stata has created rep1 (1 if rep78 is 3), rep2 (1 if rep78 is 4) and rep3 (1 if rep78 is 5). We can use the tabulate command to verify that the dummy variables were created properly.
tabulate rep78 rep1| rep78== 3.0000 rep78 | 0 1 | Total -----------+----------------------+---------- 3 | 0 30 | 30 4 | 18 0 | 18 5 | 11 0 | 11 -----------+----------------------+---------- Total | 29 30 | 59tabulate rep78 rep2| rep78== 4.0000 rep78 | 0 1 | Total -----------+----------------------+---------- 3 | 30 0 | 30 4 | 0 18 | 18 5 | 11 0 | 11 -----------+----------------------+---------- Total | 41 18 | 59tabulate rep78 rep3| rep78== 5.0000 rep78 | 0 1 | Total -----------+----------------------+---------- 3 | 30 0 | 30 4 | 18 0 | 18 5 | 0 11 | 11 -----------+----------------------+---------- Total | 48 11 | 59
Now we can include rep1 and rep2 as dummy variables in the regression model.
regress mpg price weight rep1 rep2 Source | SS df MS Number of obs = 59 -------------+------------------------------ F( 4, 54) = 26.04 Model | 1435.91975 4 358.979938 Prob > F = 0.0000 Residual | 744.317536 54 13.7836581 R-squared = 0.6586 -------------+------------------------------ Adj R-squared = 0.6333 Total | 2180.23729 58 37.5902981 Root MSE = 3.7126 ------------------------------------------------------------------------------ mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- price | -.0001126 .0002133 -0.53 0.600 -.0005403 .0003151 weight | -.005107 .0008236 -6.20 0.000 -.0067584 -.0034557 rep1 | -2.886288 1.504639 -1.92 0.060 -5.902908 .1303314 rep2 | -2.88417 1.484817 -1.94 0.057 -5.861048 .0927086 _cons | 39.89189 1.892188 21.08 0.000 36.09828 43.6855 ------------------------------------------------------------------------------
Analysis of variance
If you wanted to do an analysis of variance looking at the differences in mpg among the three repair groups, you can use the oneway command to do this.
oneway mpg rep78Analysis of Variance Source SS df MS F Prob > F ------------------------------------------------------------------------ Between groups 506.325167 2 253.162583 8.47 0.0006 Within groups 1673.91212 56 29.8912879 ------------------------------------------------------------------------ Total 2180.23729 58 37.5902981 Bartlett's test for equal variances: chi2(2) = 9.9384 Prob>chi2 = 0.007
If you include the tabulate option, you get mean mpg for the three groups, which shows that the group with the best repair rating (rep78 of 5) also has the highest mpg (27.3).
oneway mpg rep78, tabulate| Summary of mpg rep78 | Mean Std. Dev. Freq. ------------+------------------------------------ 3 | 19.433333 4.1413252 30 4 | 21.666667 4.9348699 18 5 | 27.363636 8.7323849 11 ------------+------------------------------------ Total | 21.59322 6.1310927 59 Analysis of Variance Source SS df MS F Prob > F ------------------------------------------------------------------------ Between groups 506.325167 2 253.162583 8.47 0.0006 Within groups 1673.91212 56 29.8912879 ------------------------------------------------------------------------ Total 2180.23729 58 37.5902981 Bartlett's test for equal variances: chi2(2) = 9.9384 Prob>chi2 = 0.007
If you want to include covariates, you need to use the anova command. The continuous(price weight) option tells Stata that those variables are covariates.
anova mpg rep78 c.price c.weight Number of obs = 59 R-squared = 0.6586 Root MSE = 3.71263 Adj R-squared = 0.6333 Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 1435.91975 4 358.979938 26.04 0.0000 | rep78 | 60.2987853 2 30.1493926 2.19 0.1221 price | 3.8421233 1 3.8421233 0.28 0.5997 weight | 529.932889 1 529.932889 38.45 0.0000 | Residual | 744.317536 54 13.7836581 -----------+---------------------------------------------------- Total | 2180.23729 58 37.5902981