Stata Class Notes: Analyzing Data

ttest	t-test
anova	Analysis of variance
regress	Regression
predict	Predicts after model estimation
test	Test linear hypotheses after model estimation
contrast	Contrasts and linear hypothesis tests after estimation
margins	Predicted means
marginsplot	Plot predicted means
kdensity	Kernel density estimates and graphs
qnorm	Graphs a quantile plot
logit	Logistic regression
tabulate	Crosstabs with chi-square test
signtest	Tests the equality of matched pairs of data
signrank	Wilcoxon matched-pairs signed rank test
ranksum	Mann-Whitney two-sample test
kwallis	Nonparametric analog to the one-way anova

2.0 Demonstration and explanation

We will begin by downloading the dataset for this unit over the internet.

use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear

A) Analysis of normally-distributed outcomes

Each of the following tests assumes that that the outcome is normally distributed (more accurately, that the residuals are normally distributed).

A1. t-tests

The t-test is usually used to test the equality of 2 sample means, but can also test the equality of a sample mean to some hypothesized population mean.

A one-sample t-test, testing whether the sample of writing scores was drawn from a population with a mean of 50.

ttest write = 50

A paired t-test, testing whether or not the mean of write equals the mean of read.

ttest write = read

A two-sample independent t-test with pooled (equal) variances, testing equality of means of write between males and females.

ttest write, by(female)

This is the two-sample independent t-test but with separate (unequal) variances.

ttest write, by(female) unequal

A2. Analysis of Variance

ANOVA is used to test for the equality of means among more than one group. It is equivalent to linear regression, which is more commonly used today.

Here is an example of a one-way analysis of variance, testing the equality of the mean of write among prog groups. The i. specification tells Stata that prog is a categorical variable, which Stata will then convert into dummy variables. Stata then enters all but one of those dummies (by default all but the first) into the model.

anova write i.prog

A3. Linear Regression

Linear regression is used to estimate the effect of multiple predictors, which can include both continuous and categorical variables, on a normally-distribued outcome. We use c. to indicate continuous predictors, and i. to indicate categorical predictors.

regress write c.read i.prog

We can specify the interaction of predictors using the # symbol. A single # between variables requests just the interaction, while the specification ## requests both the main effects and the interaction. Below, we request the main effects of read and female as well as their interaction:

regress write c.read##i.prog

B) Postestimation – analysis after running the model

Stata has a large suite of commands that can estimate and graph various statistics after a model has been run.

B1. Custom hypothesis testing and contrasts with test and contrast

We may be interested in performing additional tests that are not part of the specified regression model. The test command allows us to test linear combinations of the regression coefficients. For example, we may wish to test whether the coefficients are the same for prog=2 and prog=3.

test 2.prog = 3.prog

The contrast is a powerful, flexible command that can perform several custom contrasts with a single command. Below we show run a new model with an interaction of prog and female, and then use contrast to test for the significance of the female effect within each prog, and the signficance of prog within each gender:

regress write i.female##i.prog
contrast female@prog
contrast prog@female

B2. Marginal means and effects with margins and marginsplot

The margins is among Stata’s most flexible and powerful commands, which can estimate marginal (population averaged) means and effects. It is typically used to estimate cell means of an effect (often an interaction), averaged over the other covariates in the population. Additionally, the marginsplot command provides an easy way to plot the results of the margins command. Below, we estimate the marginal means of each cell of the female#prog interaction, and then plot the means

regress write i.female##i.prog
margins female#prog
marginsplot

B3. Residual analysis with predict and graphing

The predict command can be used to estimate predicted values, influence statistics, and residuals after an estimation model. Here we estimate predicted scores on the outcome of the previous linear regression and store it in a variable, pred.

predict pred

The resid option requests residual, which we store in the variable res.

predict res, resid

Let’s look at the predicted values and residuals in the first 20 observations.

list write pred res in 1/20

We can graph the residuals to check the linear regression normality assumption We use the kdensity command with the normal option to displays a density graph of the residuals with an normal distribution superimposed on the graph.

kdensity res, normal

The qnorm command produces a normal quantile plot. It plots the observed distribution of the variable against a theoretical normal distribution with the same mean and variance. Deviation from a straight line along the diagonal indicates deviation from normality.

qnorm res

B4. Influence analysis

We can check if any observations seem to be having too much influence on the model using measures such as Cook’s D. We can use the predict command with the cooksd option to request Cook’s D scores. We then create a spike plot of Cook’s D for each ID number to check for overly influential observations.

predict cook, cooksd

twoway spike cook id

C) Analysis of categorical outcomes

Analyzing categorical outcomes in linear regression model violates at least a few of the assumptions of linear regression (normality of residuals, homoskedasticity). Therefore, other tests and models are used — here we demonstrate the chi-square test and logistic regression.

C1. Chi-square test of independence with tab

The tabulate command will compute the chi-square test of independence and other measures of association with the option all.

tabulate prog ses, all

The chi-square test p-value is less trustworty if any cell has an expected count less than 5. We can display expected frequencies with the expected option.

tabulate prog ses, all expected

C2. Logistic regression with logit

Logistic regression allows estimation of the effect of multiple predictors on a binary outcome. We demonstrate the logistic regression command with the binary outcome honors (representing membership to the honors program).

tab honors

The default output for the logit command is given as coefficients in the log odds metric. To obtain odds ratios, use the or option.

logit honors c.read i.female
logit, or

The predict and margins commands can be used after logit models, as well as most other regression models. Here we estimate the predicted probabilities for each observation using predict, and the marginal predicted probabilities of honors for each gender using margins.

predict prob, pr
margins female

D) Non-parametric Tests

Non-parametric tests make no assumptions about the distribution of the outcome, so are useful when the generating distribution is unknown. However, no more than one predicted can be modeled at once.

The signtest is the nonparametric analog of the single-sample t-test.

signtest write = 50

The signrank command computes a Wilcoxon sign-ranked test, the nonparametric analog of the paired t-test.

signrank write = read

The ranksum test is the nonparametric analog of the independent two-sample t-test and is know as the Mann-Whitney or Wilcoxon test.

ranksum write, by(female)

The kwallis command computes a Kruskal-Wallis test, the non-parametric analog of the one-way ANOVA.

kwallis write, by(prog)

Most of the postestimation commands like predict, contrast and margins are not available after nonparametric tests.

3.0 For more information

Statistics with Stata 12
- Chapters 4, 7-13
Gentle Introduction to Stata, Revised Third Edition
- Chapters 6-11
Data Analysis Using Stata, Third Edition
- Chapters 8-10
An Introduction to Stata for Health Researchers, Third Edition
- Chapters 11-15
Interpreting and Visualizing Regression Models Using Stata
Stata Web Books
- Regression with Stata Webbook Includes such topics as diagnostics, categorical predictors, testing interactions and testing contrasts
Choosing the Correct Statistical Test Includes guidelines for choosing the correct non-parametric test
Data Analysis Examples Gives examples of common analysis and interpretation of the output
Annotated Output Fully annotates the output from common statistical procedures
Frequently Asked Questions Covers many topics, including ANOVA, linear regression, logistic regression and use of the margins command