use |
Load dataset into memory |

describe |
Describe a dataset |

list |
List the contents of a dataset |

codebook |
Detailed contents of a dataset |

labelbook |
Information on value labels |

log |
Create a log file |

lookfor |
Find variables in large dataset |

summarize |
Descriptive statistics |

tabstat |
Table of descriptive statistics |

graph |
High resolution graphs |

kdensity |
Kernel density plot |

sort |
Sort observations in a dataset |

histogram |
Histogram for continuous and categorical variables |

tabulate |
One- and two-way frequency tables |

correlate |
Correlations |

pwcorr |
Pairwise correlations |

view |
Display file in viewer window |

#### 2.0 Demonstration and explanation

In this section, we will demonstrate commands that allow you to get quick looks at your data for exploring. We will begin by loading **hs0.dta**, with the **use** command.

use https://stats.idre.ucla.edu/stat/data/hs0, clear

#### A) Keep a record of your work with **log**

Next, we will open a log file which will save all of the commands and the output
(except for graphs) in a text file. We use the **text** option so
that the log can be read in any text editor, such as NotePad or WordPad.

log using unit2.txt, text replace

#### B) Use **list** and **browse** to display your data

The command **list** prints your data to the screen, while **browse** opens the data editor. You can supply variable names and observation numbers to both **list** and **browse** to restrict the display to a subset of the data.

list list gender-read in 1/20 browse

#### C) Use **describe** and **codebook** to characterize your variables, and **labelbook** to characterize your data labels

The **describe** command gives information about how the variable is stored in Stata, while the **codebook** provides diverse information, including the type of variable, range, frequent values, amount of missing, etc. Here we also use **lookfor** to find all variable names or variable labels that contain an “s”.

describe codebook lookfor s

The **labelbook** command describes all value labels used in the data.

labelbook

#### D) Calculate descriptive statistics of continuous variables with **summarize**

The basic descriptive statistics command in Stata is **summarize**, which calculates means, standard deviations, and ranges. More statistics are available with the **detail** option.

summarize summarize read math science write /* summarize just these variables */ display 9.48^2 /* variance is the sd (9.48) squared */ summarize write, detail /* more stats */ sum write if read >=60 /* sum is abbreviation of summarize */ sum write if prgtype=="academic" sum write in 1/40

#### E) Graphs for exploring continuous variables

Histograms, density plots and boxplots, created by **histogram**, **kdensity** and **graph box** respectively, illustrate the distribution of variables.

histogram write, normal histogram write, normal start(30) width(5) /* wider bins for a smoother plot */ kdensity write, normal kdensity write, normal width(5) /* a smoother kdensity plot */ graph box write

#### F) Exploring continous variables by group

The **tabstat** command can calculate descriptive statistics within groups.

tabstat read write math, by(prgtype) stat(n mean sd) tabstat write, by(prgtype) stat(n mean sd p25 p50 p75)

The histogram and boxplot graphed above can both be produced separately by group, using the **by** option or the **over** option, depending on the command.

histogram math, normal by(prgtype) /* densities by prgtype */ graph box write, over(prgtype) /* box plots by prgtype */

#### G) Create frequency tables of categorical variables using **tabulate**

The **tab** (short for **tabulate**) command can produce one-way or two-way frequency tables. The **tab1**
command is a convenience command to produce multiple one-way frequency tables.

tabulate ses tab1 gender schtyp prgtype

The **tab** command followed by two variables will produce a two-way crosstabulation. We can add **row** and **col** options to get row and column percentages

tab prgtype ses tab prgtype ses, row col

#### H) Exploring relationships between continuous variables

Correlation matrices describe the pairwise correlation among a set of variables, usually continuous variables. There are two commands to create correlation matrices, **correlate** which
uses listwise deletion of missing data and **pwcorr** which uses pairwise deletion.

correlate write read science pwcorr write read science, obs

A scatter plot gives a quick graphical look at a relationship between 2 variables. We use the general purpose graphing command **twoway**, which can produce plots of many types (scatter, bar, box, area, etc.), to create simple scatter plots. The **graph matrix** command produces a matrix of pairwise scatter plots among all variables listed. The **jitter** option is used to spread apart identical observations.

twoway (scatter write read) twoway (scatter write read, jitter(2)) graph matrix read science write, half

#### I) Closing and examining the log file

We have completed all of the analyses in this unit, so it is time to close the log file.

log close

Now, let’s see what is in our log file.

view unit2.txt

#### 3.0 For more information

**Statistics with Stata 12**

**Gentle Introduction to Stata, Revised Third Edition**

**Data Analysis Using Stata, Third Edition**

**An Introduction to Stata for Health Researchers, Third Edition**

**Stata Learning Modules**