*****NOTE: THERE ARE MORE COMMANDS HERE THAN DISPLYED IN THE SEMINAR

*************** STATA *******************

* load a Stata dataset over the internet
webuse auto, clear

* change directory (not run)
* cd "C:/path/to/directory"

* histogram command
histogram weight

* comments are not executed

/* this kind of comment 
   can span
   multiple lines */
   
* use /// to continue a command over multiple lines
summarize weight ///
  length

  
***** IMPORTING DATA

* loading Stata data files
* read from hard drive; uncomment and change path below before executing
* use "C:/path/to/myfile.dta"

* load data over internet
* notice .dta extension can be omitted
use https://stats.idre.ucla.edu/stat/data/hs0, clear

* save data, replace if it exists
save hs0, replace
* clear data from memory
clear

* load data but clear memory first
use https://stats.idre.ucla.edu/stat/data/hs0, clear

* create data frame called data2
frame create data2

* load nhanes2 data into data2 frame
frame data2: webuse nhanes2

* describe height and weight in nhanes2 data
frame data2: describe height weight

* look at data frames
frame dir

* import excel file; uncomment and change path below before executing
* import excel using "C:\path\myfile.xlsx", sheet("mysheet") firstrow clear


* import csv file; uncomment and change path below before executing
* import delimited using "C:\path\myfile.csv", clear


***** HELP FILES

*open help file for command summarize
help summarize

* summary statistics for all variables
summarize
* summary statistics for just variables read and write (using abbreviated command)
summ read write
* provide additional statistics for variable read
summ read, detail


*************** GETTING TO KNOW YOUR DATA *******************

***** VIEWING DATA

* seminar dataset
use https://stats.idre.ucla.edu/stat/data/hs0, clear

* browse dataset
browse

* list all observations and all variables
list

* list read and write for first 5 observations
li read write in 1/5

***** SELECTING OBSERVATIONS

* list science for last 3 observations
li science in -3/L

* list gender, ses, and math if math > 70 
* with clean output
li gender ses math if math > 70, clean

* browse gender, ses, and read 
*  for females (gender=2) who have read > 70
browse gender ses read if gender == 2 & read > 70


*** EXERCISE 1 ***
* 1.  Use the browse command to examine the ses values for students with write 
*     score greater than 65.
*     Then, use the help file for the browse command rewrite the command to 
*     examine the ses values without labels.


***** EXPLORING DATA

* inspect values of variables read gender and prgtype 
codebook read gender prgtype

* summarize continuous variables
summarize read math

* summarize read and math for females
summarize read math if gender == 2

* detailed summary of read for females
summarize read if gender == 2, detail

* tabulate frequencies of ses
tabulate ses

* remove labels
tab ses, nolabel

* two-way tab of race and ses
tab race ses 

* with row percentages
tab race ses, row

*** EXERCISE 2 ***
* 2. Use the tab command to determine the numeric code for "Asians" in the race variable
*    Then use summarize to estimate the mean of the variable science for Asians


***** DATA VISUALIZATION

* histogram of write
histogram write

* histogram of write with normal density 
*  and intervals of length 5
hist write, normal width(5)

* boxplot of all test scores
graph box read write math science socst

* scatter plot of write vs read
scatter write read

* bar graphs
* bar graph of count of ses
graph bar (count), over(ses) 

* frequencies of gender by ses
*   asyvars colors bars by ses
graph bar (count), over(ses) over(gender) asyvars 

* layered graph of scatter plot and lowess curve
twoway (scatter write read) (lowess write read)

* layered scatter plots of write and read
*   colored by gender
twoway (scatter write read if gender == 1, mcolor(blue)) ///
(scatter write read if gender == 2, mcolor(red))


*** EXERCISE 3 ***
* 3. Use the scatter command to create a scatter plot of math on the x-axis vs write on the y-axis
*    Use the help file for scatter to change the shape of the markers to triangles


*************** DATA MANAGEMENT *******************

***** CREATING AND TRANSFORMING VARIABLES

* generating variables
* generate a sum of 3 variables
generate total = math + science + socst

* it seems 5 missing values were generated
*  let's look at variables
summarize total math science socst

* list variables when science is missing
li math science socst if science == .

* same as above, using missing() function
li math science socst if missing(science)

* replace total with just (math+socst)
*  if science is missing
replace total = math + socst if science == .

* no missing totals now
summarize total

* egen with function rowmean generates variable that
* is mean of all non-missing values of those variables
egen meantest = rowmean(read math science socst)
summarize meantest read math science socst

* standardize read
egen zread = std(read)
summarize zread

* renaming variables
rename gender female
* recode values to 0,1
recode female (1=0)(2=1)
tab female

* labeling variables (description)
label variable math "9th grade math score"
label variable schtyp "public/private school"

* the variable label will be used in some output
histogram math
tab schtyp

* schtyp before labeling
tab schtyp

* create and apply labels for schtyp
label define pubpri 1 public 2 private
label values schtyp pubpri
tab schtyp

* encoding string prgtype into
*  numeric variable prog
encode prgtype, gen(prog)

* we see that prog is numeric with labels (blue)
*  while the old variable prog is string (red)
browse prog prgtype

* we see labels by default in prog
tab prog

* use option nolabel to remove the labels
tab prog, nolabel


*** EXERCISE 4 ***
* 4. Use the generate and replace commands to create a variable called "highmath" 
*	  that takes on the value 1 if math is greater than 60, and 0 otherwise.
*	 Then use the label define command to create a set of value labels called "mathlabel", 
*     which labels the value 1 "high" and the value 0 "low"
*    Finally, use the label values command to apply the "mathlabel" labels to the 
*     newly generated variable "highmath.
*    Use the tab command on highmath to check your results.


***** DATASET OPERATIONS

* save dataset, overwrite existing file
save hs1, replace

* drop prgtype from dataset
drop prgtype

* keep just id read and math
keep id read math

* keep observation if reading > 30
keep if read > 40
summ read

* now drop if write outside range [30,70]
drop if math < 30 | math > 70
summ math

* sorting
* first look at unsorted
li in 1/5

* now sort by read and then math
sort read math
li in 1/5

* sort descending read then ascending math
gsort -read +math
li in 1/5


*** EXERCISE 5 ***
* 5. Reload the hs0 data set fresh using the following command:
use https://stats.idre.ucla.edu/stat/data/hs0, clear

*    Subset the dataset to observations with write score greater than or equal to 60.  
*     Then remove all variables except for id and write.  
*     Save this as a Stata dataset called "highwrite"

*    Reload the hs0 dataset, subset to observations with write score less than 60, 
*      remove all variables except id and write, and save this dataset as "lowwrite"
use https://stats.idre.ucla.edu/stat/data/hs0, clear

*    Reload the hs0 dataset.  Drop the write variable.  Save this dataset as "nowrite".
use https://stats.idre.ucla.edu/stat/data/hs0, clear


***** APPENDING AND MERGING

* clear an reload data
use https://stats.idre.ucla.edu/stat/data/hs0, clear

*    Subset the dataset to observations with write score greater than or equal to 60.  
*     Then remove all variables except for id and write.  
*     Save this as a Stata dataset called "highwrite"
keep if write >= 60
keep id write
save highwrite, replace

*    Reload the hs0 dataset, subset to observations with write score less than 60, 
*      remove all variables except id and write, and save this dataset as "lowwrite"
use https://stats.idre.ucla.edu/stat/data/hs0, clear
keep if write < 60
keep id write
save lowwrite, replace

*    Reload the hs0 dataset.  Drop the write variable.  Save this dataset as "nowrite".
use https://stats.idre.ucla.edu/stat/data/hs0, clear
drop write
save nowrite, replace

* append highwrite and lowwrite datasets
* first load highwrite 
use highwrite, clear

* append lowwrite
append using lowwrite
* summarize write shows 200 observations and write scores above and below 70
summ write

* merge in nowrite dataset using id to link
merge 1:1 id using nowrite


*************** BASIC STATISTICAL ANALYSIS *******************
* load new dataset
use https://stats.idre.ucla.edu/stat/data/hs1, clear

** ANALYSIS OF CONTINUOUS OUTCOMES
* many commands provide 95% CI
mean read

* testing if means are different 
*  across groups
* independent samples t-test
ttest read, by(female)

* correlation matrix of 5 variables
corr read write math science socst

* linear regression of write on continuous
*  predictor math and categorical predictor ses
regress write c.math i.ses

* look at what postestimation commands are available after regress
help regress postestimation

* postestimation examples
* predicted dependent variable
predict pred

* get residuals
predict res, residuals

* first 5 predicted values and residuals with observed write
li pred res write in 1/5


*** EXERCISE 6 ***
* 6. Use the regress command to determine if the variables female (categorical) and
*      science (continuous) are predictive of the dependent variable math.
*    One of the assumptions of linear regression is that the errors 
*     (estimated by residuals) are normally distributed.  Use the predict command
*     and the histogram command to assess this assumption.


** ANALYSIS OF CATEGORICAL OUTCOMES

* chi square test of independence
tab prog ses, chi2

* uncomment and run these lines if highmath is not in your dataset
* gen highmath = 0
* replace highmath = 1 if math > 60

*  logistic regression of binary outcome highmath predicted by 
*    by continuous(write) and female (categorical)
logit highmath c.write i.female, or


*************** OUTPUTTING TO WORD AND EXCEL *******************

** OUTPUTTING TO WORD

* create a Word docx file to export results, set default font size to 12
putdocx begin, font(12)

* first line starts a block of text in a new paragraph, using a Title style
* second line is the text to appear
* third line ends the text block
putdocx textblock begin, style(Title)
Regression of read results
putdocx textblock end

* start a block of text in a new paragraph
putdocx textblock begin
This report displays the results of the regression of read on write and math and a residuals vs fitted plot to check some of the assumptions of the linear regression model.
putdocx textblock end

* run regression of read on write and math
regress read write math
* insert regression as a Word table
putdocx table model1 = etable

* make residuals vs fitted plot
rvfplot, yline(0)
* export plot to .png file
graph export rvf.png, replace
* start a new paragraph to add spacing
putdocx paragraph
* insert .png file
putdocx image rvf.png

* save and close the .docx file, replace file if it already exists
putdocx save regress_read.docx, replace


** OUTPUTTING TO EXCEL

* open Excel file for output, replace if it exists already
putexcel set "output.xlsx", sheet("regress results") replace

* regression of read on write and female
regress read write female
* write bolded text to cell A1 in Excel file
putexcel A1 = "Regression of read", bold
* write regression table to cell A2 in Excel file
putexcel A2 = etable

* make a boxplot of read by female
graph box read, over(female)
* export plot to .png file
graph export boxread.png, replace
* add plot to cell A7 of Excel file
putexcel A7 = image("boxread.png")

* close and save Excel fiel
putexcel save

****************** SOLUTION TO IN-CLASS EXERCISES *****************************

* LOAD DATA FOR EXERCISES
use https://stats.idre.ucla.edu/stat/data/hs0, clear

*** EXERCISE 1 ***
* 1.  Use the browse command to examine the ses values and write scores  
*     for students with write score greater than 65.
*     Then, use the help file for the browse command rewrite the command to 
*     examine the ses values without labels.
browse ses write if write > 65
help browse
browse ses write if write > 65, nolabel

*** EXERCISE 2 ***
* 2. Use the tab command to determine the numeric code for "Asians" in the race variable
*    Then use summarize to estimate the mean of the variable science for Asians
tab race
tab race, nolabel
summ science if race == 2

*** EXERCISE 3 ***
* 3. Use the scatter command to create a scatter plot of math on the x-axis vs write on the y-axis
*    Use the help file for scatter to change the shape of the markers to triangles
scatter write math
scatter write math, msymbol(triangle)

*** EXERCISE 4 ***
* 4. Use the generate and replace commands to create a variable called "highmath" 
*	  that takes on the value 1 if math is greater than 60, and 0 otherwise.
*	 Then use the label define commands create a set of value labels, which you can
*     call "mathlabel", which labels the value 1 as "high" and the value 0 "low".
*    Finally, use the label values command to apply the "mathlabel" labels to the 
*     newly generated variable "highmath.
*    Use the tab command on highmath to check your results.
gen highmath = 0
replace highmath = 1 if math > 60
replace highmath = . if math == .

label define mathlabel 0 "low" 1 "high"
label values highmath mathlabel
tab highmath

*** EXERCISE 5 ***
* 5. Reload the hs0 data set fresh using the following command:
use https://stats.idre.ucla.edu/stat/data/hs0, clear

*    Subset the dataset to observations with write score greater than or equal to 60.  
*     Then remove all variables except for id and write.  
*     Save this as a Stata dataset called "highwrite"
keep if write >= 60
keep id write
save highwrite, replace

*    Reload the hs0 dataset, subset to observations with write score less than 60, 
*      remove all variables except id and write, and save this dataset as "lowwrite"
use https://stats.idre.ucla.edu/stat/data/hs0, clear
keep if write < 60
keep id write
save lowwrite, replace

*    Reload the hs0 dataset.  Drop the write variable.  Save this dataset as "nowrite".
use https://stats.idre.ucla.edu/stat/data/hs0, clear
drop write
save nowrite, replace

*** EXERCISE 6 ***
* load hs1 dataset
use hs1, clear

* 6. Use the regress command to determine if the variables female (categorical) and
*      science (continuous) are predictive of the dependent variable math.
*    One of the assumptions of linear regression is that the errors 
*     (estimated by residuals) are normally distributed.  Use the predict command
*     and the histogram command to assess this assumption.
regress math i.female c.science
predict mathres, residuals
histogram mathres, normal