use | Load dataset into memory |
describe | Describe a dataset |
list | List the contents of a dataset |
codebook | Detailed contents of a dataset |
labelbook | Information on value labels |
log | Create a log file |
lookfor | Find variables in large dataset |
summarize | Descriptive statistics |
tabstat | Table of descriptive statistics |
graph | High resolution graphs |
kdensity | Kernel density plot |
sort | Sort observations in a dataset |
histogram | Histogram for continuous and categorical variables |
tabulate | One- and two-way frequency tables |
correlate | Correlations |
pwcorr | Pairwise correlations |
view | Display file in viewer window |
2.0 Demonstration and explanation
In this section, we will demonstrate commands that allow you to get quick looks at your data for exploring. We will begin by loading hs0.dta, with the use command.
use https://stats.idre.ucla.edu/stat/data/hs0, clear
A) Keep a record of your work with log
Next, we will open a log file which will save all of the commands and the output (except for graphs) in a text file. We use the text option so that the log can be read in any text editor, such as NotePad or WordPad.
log using unit2.txt, text replace
B) Use list and browse to display your data
The command list prints your data to the screen, while browse opens the data editor. You can supply variable names and observation numbers to both list and browse to restrict the display to a subset of the data.
list list gender-read in 1/20 browse
C) Use describe and codebook to characterize your variables, and labelbook to characterize your data labels
The describe command gives information about how the variable is stored in Stata, while the codebook provides diverse information, including the type of variable, range, frequent values, amount of missing, etc. Here we also use lookfor to find all variable names or variable labels that contain an “s”.
describe codebook lookfor s
The labelbook command describes all value labels used in the data.
labelbook
D) Calculate descriptive statistics of continuous variables with summarize
The basic descriptive statistics command in Stata is summarize, which calculates means, standard deviations, and ranges. More statistics are available with the detail option.
summarize summarize read math science write /* summarize just these variables */ display 9.48^2 /* variance is the sd (9.48) squared */ summarize write, detail /* more stats */ sum write if read >=60 /* sum is abbreviation of summarize */ sum write if prgtype=="academic" sum write in 1/40
E) Graphs for exploring continuous variables
Histograms, density plots and boxplots, created by histogram, kdensity and graph box respectively, illustrate the distribution of variables.
histogram write, normal histogram write, normal start(30) width(5) /* wider bins for a smoother plot */ kdensity write, normal kdensity write, normal width(5) /* a smoother kdensity plot */ graph box write
F) Exploring continous variables by group
The tabstat command can calculate descriptive statistics within groups.
tabstat read write math, by(prgtype) stat(n mean sd) tabstat write, by(prgtype) stat(n mean sd p25 p50 p75)
The histogram and boxplot graphed above can both be produced separately by group, using the by option or the over option, depending on the command.
histogram math, normal by(prgtype) /* densities by prgtype */ graph box write, over(prgtype) /* box plots by prgtype */
G) Create frequency tables of categorical variables using tabulate
The tab (short for tabulate) command can produce one-way or two-way frequency tables. The tab1 command is a convenience command to produce multiple one-way frequency tables.
tabulate ses tab1 gender schtyp prgtype
The tab command followed by two variables will produce a two-way crosstabulation. We can add row and col options to get row and column percentages
tab prgtype ses tab prgtype ses, row col
H) Exploring relationships between continuous variables
Correlation matrices describe the pairwise correlation among a set of variables, usually continuous variables. There are two commands to create correlation matrices, correlate which uses listwise deletion of missing data and pwcorr which uses pairwise deletion.
correlate write read science pwcorr write read science, obs
A scatter plot gives a quick graphical look at a relationship between 2 variables. We use the general purpose graphing command twoway, which can produce plots of many types (scatter, bar, box, area, etc.), to create simple scatter plots. The graph matrix command produces a matrix of pairwise scatter plots among all variables listed. The jitter option is used to spread apart identical observations.
twoway (scatter write read) twoway (scatter write read, jitter(2)) graph matrix read science write, half
I) Closing and examining the log file
We have completed all of the analyses in this unit, so it is time to close the log file.
log close
Now, let’s see what is in our log file.
view unit2.txt