*****NOTE: THERE ARE MORE COMMANDS HERE THAN DISPLYED IN THE SEMINAR
*************** STATA *******************
* load a Stata dataset over the internet
webuse auto, clear
* change directory (not run)
* cd "C:/path/to/directory"
* histogram command
histogram weight
* comments are not executed
/* this kind of comment
can span
multiple lines */
* use /// to continue a command over multiple lines
summarize weight ///
length
***** IMPORTING DATA
* loading Stata data files
* read from hard drive; uncomment and change path below before executing
* use "C:/path/to/myfile.dta"
* load data over internet
* notice .dta extension can be omitted
use https://stats.idre.ucla.edu/stat/data/hs0, clear
* save data, replace if it exists
save hs0, replace
* clear data from memory
clear
* load data but clear memory first
use https://stats.idre.ucla.edu/stat/data/hs0, clear
* create data frame called data2
frame create data2
* load nhanes2 data into data2 frame
frame data2: webuse nhanes2
* describe height and weight in nhanes2 data
frame data2: describe height weight
* look at data frames
frame dir
* import excel file; uncomment and change path below before executing
* import excel using "C:\path\myfile.xlsx", sheet("mysheet") firstrow clear
* import csv file; uncomment and change path below before executing
* import delimited using "C:\path\myfile.csv", clear
***** HELP FILES
*open help file for command summarize
help summarize
* summary statistics for all variables
summarize
* summary statistics for just variables read and write (using abbreviated command)
summ read write
* provide additional statistics for variable read
summ read, detail
*************** GETTING TO KNOW YOUR DATA *******************
***** VIEWING DATA
* seminar dataset
use https://stats.idre.ucla.edu/stat/data/hs0, clear
* browse dataset
browse
* list all observations and all variables
list
* list read and write for first 5 observations
li read write in 1/5
***** SELECTING OBSERVATIONS
* list science for last 3 observations
li science in -3/L
* list gender, ses, and math if math > 70
* with clean output
li gender ses math if math > 70, clean
* browse gender, ses, and read
* for females (gender=2) who have read > 70
browse gender ses read if gender == 2 & read > 70
*** EXERCISE 1 ***
* 1. Use the browse command to examine the ses values for students with write
* score greater than 65.
* Then, use the help file for the browse command rewrite the command to
* examine the ses values without labels.
***** EXPLORING DATA
* inspect values of variables read gender and prgtype
codebook read gender prgtype
* summarize continuous variables
summarize read math
* summarize read and math for females
summarize read math if gender == 2
* detailed summary of read for females
summarize read if gender == 2, detail
* tabulate frequencies of ses
tabulate ses
* remove labels
tab ses, nolabel
* two-way tab of race and ses
tab race ses
* with row percentages
tab race ses, row
*** EXERCISE 2 ***
* 2. Use the tab command to determine the numeric code for "Asians" in the race variable
* Then use summarize to estimate the mean of the variable science for Asians
***** DATA VISUALIZATION
* histogram of write
histogram write
* histogram of write with normal density
* and intervals of length 5
hist write, normal width(5)
* boxplot of all test scores
graph box read write math science socst
* scatter plot of write vs read
scatter write read
* bar graphs
* bar graph of count of ses
graph bar (count), over(ses)
* frequencies of gender by ses
* asyvars colors bars by ses
graph bar (count), over(ses) over(gender) asyvars
* layered graph of scatter plot and lowess curve
twoway (scatter write read) (lowess write read)
* layered scatter plots of write and read
* colored by gender
twoway (scatter write read if gender == 1, mcolor(blue)) ///
(scatter write read if gender == 2, mcolor(red))
*** EXERCISE 3 ***
* 3. Use the scatter command to create a scatter plot of math on the x-axis vs write on the y-axis
* Use the help file for scatter to change the shape of the markers to triangles
*************** DATA MANAGEMENT *******************
***** CREATING AND TRANSFORMING VARIABLES
* generating variables
* generate a sum of 3 variables
generate total = math + science + socst
* it seems 5 missing values were generated
* let's look at variables
summarize total math science socst
* list variables when science is missing
li math science socst if science == .
* same as above, using missing() function
li math science socst if missing(science)
* replace total with just (math+socst)
* if science is missing
replace total = math + socst if science == .
* no missing totals now
summarize total
* egen with function rowmean generates variable that
* is mean of all non-missing values of those variables
egen meantest = rowmean(read math science socst)
summarize meantest read math science socst
* standardize read
egen zread = std(read)
summarize zread
* renaming variables
rename gender female
* recode values to 0,1
recode female (1=0)(2=1)
tab female
* labeling variables (description)
label variable math "9th grade math score"
label variable schtyp "public/private school"
* the variable label will be used in some output
histogram math
tab schtyp
* schtyp before labeling
tab schtyp
* create and apply labels for schtyp
label define pubpri 1 public 2 private
label values schtyp pubpri
tab schtyp
* encoding string prgtype into
* numeric variable prog
encode prgtype, gen(prog)
* we see that prog is numeric with labels (blue)
* while the old variable prog is string (red)
browse prog prgtype
* we see labels by default in prog
tab prog
* use option nolabel to remove the labels
tab prog, nolabel
*** EXERCISE 4 ***
* 4. Use the generate and replace commands to create a variable called "highmath"
* that takes on the value 1 if math is greater than 60, and 0 otherwise.
* Then use the label define command to create a set of value labels called "mathlabel",
* which labels the value 1 "high" and the value 0 "low"
* Finally, use the label values command to apply the "mathlabel" labels to the
* newly generated variable "highmath.
* Use the tab command on highmath to check your results.
***** DATASET OPERATIONS
* save dataset, overwrite existing file
save hs1, replace
* drop prgtype from dataset
drop prgtype
* keep just id read and math
keep id read math
* keep observation if reading > 30
keep if read > 40
summ read
* now drop if write outside range [30,70]
drop if math < 30 | math > 70
summ math
* sorting
* first look at unsorted
li in 1/5
* now sort by read and then math
sort read math
li in 1/5
* sort descending read then ascending math
gsort -read +math
li in 1/5
*** EXERCISE 5 ***
* 5. Reload the hs0 data set fresh using the following command:
use https://stats.idre.ucla.edu/stat/data/hs0, clear
* Subset the dataset to observations with write score greater than or equal to 60.
* Then remove all variables except for id and write.
* Save this as a Stata dataset called "highwrite"
* Reload the hs0 dataset, subset to observations with write score less than 60,
* remove all variables except id and write, and save this dataset as "lowwrite"
use https://stats.idre.ucla.edu/stat/data/hs0, clear
* Reload the hs0 dataset. Drop the write variable. Save this dataset as "nowrite".
use https://stats.idre.ucla.edu/stat/data/hs0, clear
***** APPENDING AND MERGING
* clear an reload data
use https://stats.idre.ucla.edu/stat/data/hs0, clear
* Subset the dataset to observations with write score greater than or equal to 60.
* Then remove all variables except for id and write.
* Save this as a Stata dataset called "highwrite"
keep if write >= 60
keep id write
save highwrite, replace
* Reload the hs0 dataset, subset to observations with write score less than 60,
* remove all variables except id and write, and save this dataset as "lowwrite"
use https://stats.idre.ucla.edu/stat/data/hs0, clear
keep if write < 60
keep id write
save lowwrite, replace
* Reload the hs0 dataset. Drop the write variable. Save this dataset as "nowrite".
use https://stats.idre.ucla.edu/stat/data/hs0, clear
drop write
save nowrite, replace
* append highwrite and lowwrite datasets
* first load highwrite
use highwrite, clear
* append lowwrite
append using lowwrite
* summarize write shows 200 observations and write scores above and below 70
summ write
* merge in nowrite dataset using id to link
merge 1:1 id using nowrite
*************** BASIC STATISTICAL ANALYSIS *******************
* load new dataset
use https://stats.idre.ucla.edu/stat/data/hs1, clear
** ANALYSIS OF CONTINUOUS OUTCOMES
* many commands provide 95% CI
mean read
* testing if means are different
* across groups
* independent samples t-test
ttest read, by(female)
* correlation matrix of 5 variables
corr read write math science socst
* linear regression of write on continuous
* predictor math and categorical predictor ses
regress write c.math i.ses
* look at what postestimation commands are available after regress
help regress postestimation
* postestimation examples
* predicted dependent variable
predict pred
* get residuals
predict res, residuals
* first 5 predicted values and residuals with observed write
li pred res write in 1/5
*** EXERCISE 6 ***
* 6. Use the regress command to determine if the variables female (categorical) and
* science (continuous) are predictive of the dependent variable math.
* One of the assumptions of linear regression is that the errors
* (estimated by residuals) are normally distributed. Use the predict command
* and the histogram command to assess this assumption.
** ANALYSIS OF CATEGORICAL OUTCOMES
* chi square test of independence
tab prog ses, chi2
* uncomment and run these lines if highmath is not in your dataset
* gen highmath = 0
* replace highmath = 1 if math > 60
* logistic regression of binary outcome highmath predicted by
* by continuous(write) and female (categorical)
logit highmath c.write i.female, or
*************** OUTPUTTING TO WORD AND EXCEL *******************
** OUTPUTTING TO WORD
* create a Word docx file to export results, set default font size to 12
putdocx begin, font(12)
* first line starts a block of text in a new paragraph, using a Title style
* second line is the text to appear
* third line ends the text block
putdocx textblock begin, style(Title)
Regression of read results
putdocx textblock end
* start a block of text in a new paragraph
putdocx textblock begin
This report displays the results of the regression of read on write and math and a residuals vs fitted plot to check some of the assumptions of the linear regression model.
putdocx textblock end
* run regression of read on write and math
regress read write math
* insert regression as a Word table
putdocx table model1 = etable
* make residuals vs fitted plot
rvfplot, yline(0)
* export plot to .png file
graph export rvf.png, replace
* start a new paragraph to add spacing
putdocx paragraph
* insert .png file
putdocx image rvf.png
* save and close the .docx file, replace file if it already exists
putdocx save regress_read.docx, replace
** OUTPUTTING TO EXCEL
* open Excel file for output, replace if it exists already
putexcel set "output.xlsx", sheet("regress results") replace
* regression of read on write and female
regress read write female
* write bolded text to cell A1 in Excel file
putexcel A1 = "Regression of read", bold
* write regression table to cell A2 in Excel file
putexcel A2 = etable
* make a boxplot of read by female
graph box read, over(female)
* export plot to .png file
graph export boxread.png, replace
* add plot to cell A7 of Excel file
putexcel A7 = image("boxread.png")
* close and save Excel fiel
putexcel save
****************** SOLUTION TO IN-CLASS EXERCISES *****************************
* LOAD DATA FOR EXERCISES
use https://stats.idre.ucla.edu/stat/data/hs0, clear
*** EXERCISE 1 ***
* 1. Use the browse command to examine the ses values and write scores
* for students with write score greater than 65.
* Then, use the help file for the browse command rewrite the command to
* examine the ses values without labels.
browse ses write if write > 65
help browse
browse ses write if write > 65, nolabel
*** EXERCISE 2 ***
* 2. Use the tab command to determine the numeric code for "Asians" in the race variable
* Then use summarize to estimate the mean of the variable science for Asians
tab race
tab race, nolabel
summ science if race == 2
*** EXERCISE 3 ***
* 3. Use the scatter command to create a scatter plot of math on the x-axis vs write on the y-axis
* Use the help file for scatter to change the shape of the markers to triangles
scatter write math
scatter write math, msymbol(triangle)
*** EXERCISE 4 ***
* 4. Use the generate and replace commands to create a variable called "highmath"
* that takes on the value 1 if math is greater than 60, and 0 otherwise.
* Then use the label define commands create a set of value labels, which you can
* call "mathlabel", which labels the value 1 as "high" and the value 0 "low".
* Finally, use the label values command to apply the "mathlabel" labels to the
* newly generated variable "highmath.
* Use the tab command on highmath to check your results.
gen highmath = 0
replace highmath = 1 if math > 60
replace highmath = . if math == .
label define mathlabel 0 "low" 1 "high"
label values highmath mathlabel
tab highmath
*** EXERCISE 5 ***
* 5. Reload the hs0 data set fresh using the following command:
use https://stats.idre.ucla.edu/stat/data/hs0, clear
* Subset the dataset to observations with write score greater than or equal to 60.
* Then remove all variables except for id and write.
* Save this as a Stata dataset called "highwrite"
keep if write >= 60
keep id write
save highwrite, replace
* Reload the hs0 dataset, subset to observations with write score less than 60,
* remove all variables except id and write, and save this dataset as "lowwrite"
use https://stats.idre.ucla.edu/stat/data/hs0, clear
keep if write < 60
keep id write
save lowwrite, replace
* Reload the hs0 dataset. Drop the write variable. Save this dataset as "nowrite".
use https://stats.idre.ucla.edu/stat/data/hs0, clear
drop write
save nowrite, replace
*** EXERCISE 6 ***
* load hs1 dataset
use hs1, clear
* 6. Use the regress command to determine if the variables female (categorical) and
* science (continuous) are predictive of the dependent variable math.
* One of the assumptions of linear regression is that the errors
* (estimated by residuals) are normally distributed. Use the predict command
* and the histogram command to assess this assumption.
regress math i.female c.science
predict mathres, residuals
histogram mathres, normal