Stata Class Notes: Exploring Data

use	Load dataset into memory
describe	Describe a dataset
list	List the contents of a dataset
codebook	Detailed contents of a dataset
labelbook	Information on value labels
log	Create a log file
lookfor	Find variables in large dataset
summarize	Descriptive statistics
tabstat	Table of descriptive statistics
graph	High resolution graphs
kdensity	Kernel density plot
sort	Sort observations in a dataset
histogram	Histogram for continuous and categorical variables
tabulate	One- and two-way frequency tables
correlate	Correlations
pwcorr	Pairwise correlations
view	Display file in viewer window

2.0 Demonstration and explanation

In this section, we will demonstrate commands that allow you to get quick looks at your data for exploring. We will begin by loading hs0.dta, with the use command.

use https://stats.idre.ucla.edu/stat/data/hs0, clear

A) Keep a record of your work with log

Next, we will open a log file which will save all of the commands and the output (except for graphs) in a text file. We use the text option so that the log can be read in any text editor, such as NotePad or WordPad.

log using unit2.txt, text replace

B) Use list and browse to display your data

The command list prints your data to the screen, while browse opens the data editor. You can supply variable names and observation numbers to both list and browse to restrict the display to a subset of the data.

list
list gender-read in 1/20
browse

C) Use describe and codebook to characterize your variables, and labelbook to characterize your data labels

The describe command gives information about how the variable is stored in Stata, while the codebook provides diverse information, including the type of variable, range, frequent values, amount of missing, etc. Here we also use lookfor to find all variable names or variable labels that contain an “s”.

describe
codebook
lookfor s

The labelbook command describes all value labels used in the data.

labelbook

D) Calculate descriptive statistics of continuous variables with summarize

The basic descriptive statistics command in Stata is summarize, which calculates means, standard deviations, and ranges. More statistics are available with the detail option.

summarize                   
summarize read math science write       /* summarize just these variables */ 
display 9.48^2                          /* variance is the sd (9.48) squared */
summarize write, detail                 /* more stats */
sum write if read >=60                  /* sum is abbreviation of summarize */
sum write if prgtype=="academic"
sum write in 1/40

E) Graphs for exploring continuous variables

Histograms, density plots and boxplots, created by histogram, kdensity and graph box respectively, illustrate the distribution of variables.

histogram write, normal
histogram write, normal start(30) width(5) /* wider bins for a smoother plot */ 
kdensity write, normal
kdensity write, normal width(5)            /* a smoother kdensity plot */
graph box write

F) Exploring continous variables by group

The tabstat command can calculate descriptive statistics within groups.

tabstat read write math, by(prgtype) stat(n mean sd)
tabstat write, by(prgtype) stat(n mean sd p25 p50 p75)

The histogram and boxplot graphed above can both be produced separately by group, using the by option or the over option, depending on the command.

histogram math, normal by(prgtype)          /* densities by prgtype */
graph box write, over(prgtype)             /* box plots by prgtype */

G) Create frequency tables of categorical variables using tabulate

The tab (short for tabulate) command can produce one-way or two-way frequency tables. The tab1 command is a convenience command to produce multiple one-way frequency tables.

tabulate ses
tab1 gender schtyp prgtype

The tab command followed by two variables will produce a two-way crosstabulation. We can add row and col options to get row and column percentages

tab prgtype ses
tab prgtype ses, row col

H) Exploring relationships between continuous variables

Correlation matrices describe the pairwise correlation among a set of variables, usually continuous variables. There are two commands to create correlation matrices, correlate which uses listwise deletion of missing data and pwcorr which uses pairwise deletion.

correlate write read science
pwcorr write read science, obs

A scatter plot gives a quick graphical look at a relationship between 2 variables. We use the general purpose graphing command twoway, which can produce plots of many types (scatter, bar, box, area, etc.), to create simple scatter plots. The graph matrix command produces a matrix of pairwise scatter plots among all variables listed. The jitter option is used to spread apart identical observations.

twoway (scatter write read)
twoway (scatter write read, jitter(2))
graph matrix read science write, half

I) Closing and examining the log file

We have completed all of the analyses in this unit, so it is time to close the log file.

log close

Now, let’s see what is in our log file.

view unit2.txt

3.0 For more information

Statistics with Stata 12

Chapters 3 and 5

Gentle Introduction to Stata, Revised Third Edition

Chapter 5

Data Analysis Using Stata, Third Edition

Chapter 7

An Introduction to Stata for Health Researchers, Third Edition

Chapter 11

Stata Learning Modules

Descriptive information and statistics

Using if with Stata commands