ttest | t-test |
anova | Analysis of variance |
regress | Regression |
predict | Predicts after model estimation |
test | Test linear hypotheses after model estimation |
contrast | Contrasts and linear hypothesis tests after estimation |
margins | Predicted means |
marginsplot | Plot predicted means |
kdensity | Kernel density estimates and graphs |
qnorm | Graphs a quantile plot |
logit | Logistic regression |
tabulate | Crosstabs with chi-square test |
signtest | Tests the equality of matched pairs of data |
signrank | Wilcoxon matched-pairs signed rank test |
ranksum | Mann-Whitney two-sample test |
kwallis | Nonparametric analog to the one-way anova |
2.0 Demonstration and explanation
We will begin by downloading the dataset for this unit over the internet.
use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear
A) Analysis of normally-distributed outcomes
Each of the following tests assumes that that the outcome is normally distributed (more accurately, that the residuals are normally distributed).
A1. t-tests
The t-test is usually used to test the equality of 2 sample means, but can also test the equality of a sample mean to some hypothesized population mean.
A one-sample t-test, testing whether the sample of writing scores was drawn from a population with a mean of 50.
ttest write = 50
A paired t-test, testing whether or not the mean of write equals the mean of read.
ttest write = read
A two-sample independent t-test with pooled (equal) variances, testing equality of means of write between males and females.
ttest write, by(female)
This is the two-sample independent t-test but with separate (unequal) variances.
ttest write, by(female) unequal
A2. Analysis of Variance
ANOVA is used to test for the equality of means among more than one group. It is equivalent to linear regression, which is more commonly used today.
Here is an example of a one-way analysis of variance, testing the equality of the mean of write among prog groups. The i. specification tells Stata that prog is a categorical variable, which Stata will then convert into dummy variables. Stata then enters all but one of those dummies (by default all but the first) into the model.
anova write i.prog
A3. Linear Regression
Linear regression is used to estimate the effect of multiple predictors, which can include both continuous and categorical variables, on a normally-distribued outcome. We use c. to indicate continuous predictors, and i. to indicate categorical predictors.
regress write c.read i.prog
We can specify the interaction of predictors using the # symbol. A single # between variables requests just the interaction, while the specification ## requests both the main effects and the interaction. Below, we request the main effects of read and female as well as their interaction:
regress write c.read##i.prog
B) Postestimation – analysis after running the model
Stata has a large suite of commands that can estimate and graph various statistics after a model has been run.
B1. Custom hypothesis testing and contrasts with test and contrast
We may be interested in performing additional tests that are not part of the specified regression model. The test command allows us to test linear combinations of the regression coefficients. For example, we may wish to test whether the coefficients are the same for prog=2 and prog=3.
test 2.prog = 3.prog
The contrast is a powerful, flexible command that can perform several custom contrasts with a single command. Below we show run a new model with an interaction of prog and female, and then use contrast to test for the significance of the female effect within each prog, and the signficance of prog within each gender:
regress write i.female##i.prog contrast female@prog contrast prog@female
B2. Marginal means and effects with margins and marginsplot
The margins is among Stata’s most flexible and powerful commands, which can estimate marginal (population averaged) means and effects. It is typically used to estimate cell means of an effect (often an interaction), averaged over the other covariates in the population. Additionally, the marginsplot command provides an easy way to plot the results of the margins command. Below, we estimate the marginal means of each cell of the female#prog interaction, and then plot the means
regress write i.female##i.prog margins female#prog marginsplot
B3. Residual analysis with predict and graphing
The predict command can be used to estimate predicted values, influence statistics, and residuals after an estimation model. Here we estimate predicted scores on the outcome of the previous linear regression and store it in a variable, pred.
predict pred
The resid option requests residual, which we store in the variable res.
predict res, resid
Let’s look at the predicted values and residuals in the first 20 observations.
list write pred res in 1/20
We can graph the residuals to check the linear regression normality assumption We use the kdensity command with the normal option to displays a density graph of the residuals with an normal distribution superimposed on the graph.
kdensity res, normal
The qnorm command produces a normal quantile plot. It plots the observed distribution of the variable against a theoretical normal distribution with the same mean and variance. Deviation from a straight line along the diagonal indicates deviation from normality.
qnorm res
B4. Influence analysis
We can check if any observations seem to be having too much influence on the model using measures such as Cook’s D. We can use the predict command with the cooksd option to request Cook’s D scores. We then create a spike plot of Cook’s D for each ID number to check for overly influential observations.
predict cook, cooksd
twoway spike cook id
C) Analysis of categorical outcomes
Analyzing categorical outcomes in linear regression model violates at least a few of the assumptions of linear regression (normality of residuals, homoskedasticity). Therefore, other tests and models are used — here we demonstrate the chi-square test and logistic regression.
C1. Chi-square test of independence with tab
The tabulate command will compute the chi-square test of independence and other measures of association with the option all.
tabulate prog ses, all
The chi-square test p-value is less trustworty if any cell has an expected count less than 5. We can display expected frequencies with the expected option.
tabulate prog ses, all expected
C2. Logistic regression with logit
Logistic regression allows estimation of the effect of multiple predictors on a binary outcome. We demonstrate the logistic regression command with the binary outcome honors (representing membership to the honors program).
tab honors
The default output for the logit command is given as coefficients in the log odds metric. To obtain odds ratios, use the or option.
logit honors c.read i.female logit, or
The predict and margins commands can be used after logit models, as well as most other regression models. Here we estimate the predicted probabilities for each observation using predict, and the marginal predicted probabilities of honors for each gender using margins.
predict prob, pr margins female
D) Non-parametric Tests
Non-parametric tests make no assumptions about the distribution of the outcome, so are useful when the generating distribution is unknown. However, no more than one predicted can be modeled at once.
The signtest is the nonparametric analog of the single-sample t-test.
signtest write = 50
The signrank command computes a Wilcoxon sign-ranked test, the nonparametric analog of the paired t-test.
signrank write = read
The ranksum test is the nonparametric analog of the independent two-sample t-test and
is know as the Mann-Whitney or Wilcoxon test.
ranksum write, by(female)
The kwallis command computes a Kruskal-Wallis test, the non-parametric analog of the one-way ANOVA.
kwallis write, by(prog)
Most of the postestimation commands like predict, contrast and margins are not available after nonparametric tests.
3.0 For more information
- Statistics
with Stata 12
- Chapters 4, 7-13
- Gentle Introduction to Stata, Revised Third Edition
- Chapters 6-11
- Data Analysis Using Stata, Third Edition
- Chapters 8-10
-
An Introduction to Stata for Health Researchers, Third Edition
- Chapters 11-15
- Interpreting and Visualizing Regression Models Using Stata
- Stata Web Books
- Regression with Stata Webbook
Includes such topics as diagnostics, categorical predictors, testing interactions and testing contrasts
- Regression with Stata Webbook
- Choosing the Correct Statistical Test
Includes guidelines for choosing the correct non-parametric test - Data Analysis Examples
Gives examples of common analysis and interpretation of the output - Annotated Output
Fully annotates the output from common statistical procedures - Frequently Asked Questions
Covers many topics, including ANOVA, linear regression, logistic regression and use of the margins command