Many researchers use Stata without ever writing a program even though programming could make them more efficient in their data analysis projects. Stata programming is not difficult since it mainly involves the use of Stata commands that you already use. The trick to Stata programming is to use the appropriate commands in the right sequence. Of course, this is the trick to any kind of programming.
There are two kinds of files that are used in Stata programming, do-files and ado-files. Do-files are run from the command line using the do command, for example,
do hsbcheck hsberr
Ado-files, on the other hand, work like ordinary Stata commands by just using the file name in the command line, for example,
ttest write, by(female)
In fact, many of the built-in Stata commands are just ado-files, like the ttest command shown above. You can look at the source code for the ado commands using the viewsource command, for example,
viewsource ttest.ado
You can also use viewsource with your do-files.
viewsource hsbcheck.do
Do-files can be placed in the same folder as the data but ado-files need to go where Stata can find them. The best place for user written ado-files is in the /ado/personal/ directory. The location of this directory can vary for system to system.
We will try to give you a feel for Stata programming by covering the following topics:
- Creating and using do-files for checking and cleaning data.
- Using do-files for analyzing data.
- Writing an ado program to create a statistical command.
- Using Stata and Mata matrix commands
Dirty Data
We will create a do-file, hsbcheck.do, that contains commands that will display observations with incorrect or impossible values.
/* begin hsbcheck.do */ version 11.0 clear use `1', clear set more off nmissing describe summarize list id if id200 list id gender if ~ inlist(gender,1,2,.) list id race if ~ inlist(race,1,2,3,4,.) list id ses if ~ inlist(ses,1,2,3,.) list id schtyp if ~ inlist(schtyp,1,2,.) list id prog if ~ inlist(prog,1,2,3,.) list id read if (read99) & read~=. list id write if (write99) & write~=. list id math if (math99) & math~=. list id science if (science99) & science~=. list id socst if (socst99) & socst~=. *list id female if (female1) & female~=. set more on /* end hsbcheck.do */
Here is how our do-file is used with the dataset hsberr.
do hsbcheck hsberr [output omitted]
So how did the hsbcheck program “know” which file to use? This was done using a macro variable, in this case, `1′, which takes the first term typed after the name fof the program and treats it as as file name. Macro variables have many uses including as variable names or numeric values. We will see additional uses of macro variables in other programs.
Now that we know what errors there are in the data we can write a do-file that will fix the errors. When we know the correct value of an observation, we will replace the incorrect value with the correct one. When we do not know the correct value for an observation, we will replace the incorrect value with missing. The do-file hsbfix.do will read in hsberr, correct the errors and save the corrected file as hsbclean. Here is what hsbfix.do looks like.
/* begin hsbfix.do */ use hsberr, clear replace id=193 if id==1193 replace read=47 if read==147 replace science=61 if science==-61 replace gender=. if gender==5 replace race=. if race4 replace ses=. if ses3 replace schtyp=. if schtyp2 replace prog=. if prog3 tab gender tab reg /* create female from gender */ generate female = gender recode female 1=0 2=1 l label define fem 0 "male" 1 "female" label value female fem tab female gender label data "hsb clean data using hsberr.do" save hsbclean, replace /* end hsbfix.do */
One important thing to note is that after we fix the incorrect values, we will save the data file with a new name. We will never change any of the values in the original data file, hsberr.
First, we will run hsbfix on the original file hsberr then, as a check, we will run hsbcheck on the new file hsbclean.
do hsbfix do hsbcheck hsbclean [output omitted]
Analyze This!
Next, we will create a do-file that contains all of the commands that we need to run our data analysis. This do-file will be called hsbanalyze.do.
/* begin hsbanalyze.do */ log using hsb_fall_2011.txt, text replace summarize read write socst univar read write math science tabstat write, stat(n mean sd p25 p50 p75) by(female) tabstat write, stat(n mean sd p25 p50 p75) by(prog) ttest write, by(female) hist write, normal start(30) width(5) name(hist_write, replace) kdensity write, normal name(kd_write, replace) tab1 female ses prog tab prog ses, all correlate write read socst female graph matrix read socst write, half name(grmat, replace) regress write read i.female##c.socst margins female, at(socst=(25(5)70)) atmeans asbalanced marginsplot, recast(line) recastci(rarea) name(plot, replace) log close /* end hsbanalyze */
Now, let’s use hsbanalyze with our data file hsbclean.
use hsbdemo, clear do hsbanalyze [output omitted]
This may not seem all that useful; after all, you could just as easily type each of the commands into the command window, but what if your coauthor comes to you and says, “we need to redo the whole analysis using only schtyp equal to one.” Here’s all you have to do.
keep if schtyp==1 do hsbanalyze [output omitted]
Many Happy Returns
Return list are one of the most powerful and useful features of Stata. There are three commonly used return lists: 1) return list for ordinary nonestimation commands; 2) ereturn list for full estimation commands; and 3) creturn list for a list constants and system parameters.
Let’s start with the creturn list (abbr: cret lis).
cret lis
You can access any the values using c(name) by replacing name with the name of the function. For example to get today’s date and time,
display "Today is" c(current_date) " and the current time is " c(current_time)
This can also be useful appending a date onto a file name.
local today c(current_date) local today = subinstr(`today'," ","_",.) save hsb`today', replace
Here is an example of the return list (abbr: ret lis) following the summarize command.
summarize write, detail writing score ------------------------------------------------------------- Percentiles Smallest 1% 31 31 5% 35.5 31 10% 39 31 Obs 200 25% 45.5 31 Sum of Wgt. 200 50% 54 Mean 52.775 Largest Std. Dev. 9.478586 75% 60 67 90% 65 67 Variance 89.84359 95% 65 67 Skewness -.4784158 99% 67 67 Kurtosis 2.238527 ret lis scalars: r(N) = 200 r(sum_w) = 200 r(mean) = 52.775 r(Var) = 89.84359296482411 r(sd) = 9.47858602138653 r(skewness) = -.4784157665394925 r(kurtosis) = 2.238527050562138 r(sum) = 10555 r(min) = 31 r(max) = 67 r(p1) = 31 r(p5) = 35.5 r(p10) = 39 r(p25) = 45.5 r(p50) = 54 r(p75) = 60 r(p90) = 65 r(p95) = 65 r(p99) = 67
We can use this information to compute a statistics, such as, the coefficient of variation that Stata does not provide.
display as txt "Coefficient of variation = " as res r(sd)/abs(r(mean))*100 Coefficient of variation = 17.960371
The ereturn list (abbr: eret lis) is used following estimation commands, such as, regress, anova, logit, sem, etc. Here is an example following regress.
use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear regress write read female [output omitted] eret lis scalars: e(N) = 200 e(df_m) = 2 e(df_r) = 197 e(F) = 77.21062421518373 e(r2) = .439419213038751 e(rmse) = 7.132734938503835 e(mss) = 7856.321182518197 e(rss) = 10022.5538174818 e(r2_a) = .4337280375366064 e(ll) = -675.2152914029984 e(ll_0) = -733.0934827146214 e(rank) = 3 macros: e(cmdline) : "regress write read female" e(title) : "Linear regression" e(marginsok) : "XB default" e(vce) : "ols" e(depvar) : "write" e(cmd) : "regress" e(properties) : "b V" e(predict) : "regres_p" e(model) : "ols" e(estat_cmd) : "regress_estat" matrices: e(b) : 1 x 3 e(V) : 3 x 3 functions: e(sample)
The matrix e(b) contains the parameter estimates while e(V) has the covariance matrix of the parameter estimates.
mat list e(b) e(b)[1,3] read female _cons y1 .56588693 5.486894 20.228368 mat list e(V) symmetric e(V)[3,3] read female _cons read .00243887 female .00265893 1.0287262 _cons -.12883112 -.69953157 7.3644737
You can also access the coefficients and standard errors using _b[varname] and _se[varname]. For example, if you wanted the predicted score for a female with a reading score of 60, you could type the following.
display _b[_cons] + _b[female]*1 + _b[read]*60 59.668478
Macrobiotics
Macro variables are a good way to store values for later use. Stata supports two kinds of macro variables: 1) global macros and 2) local macros. Global macros are saved until Stata is shut down or the macros are cleared while local macros exist only while the do-file or ado-file is being run. Then they disappear. Macros have a name and a way to refer to the values stored in the macros. For global macros $name refers to the value of the macro called name. For local macros `name’ (watch out for the two kinds quote marks) refers to the value of the macro called name.
Say I wanted to compute the difference in medians for two groups. Here is one way you could do this using local macros.
quietly sum write if female==0, detail local mmedian = r(p50) quietly sum write if female==1, detail local fmedian = r(p50) display "median difference = " `fmedian' - `mmedian' median difference = 5 macro lis [output omitted]
Here is the same thing using global macros.
quietly sum write if female==0, detail global mmedian = r(p50) quietly sum write if female==1, detail global fmedian = r(p50) display "median difference = " $fmedian - $mmedian median difference = 5 macro list fmedian: 57 mmedian: 52 [output omitted]
This only scratches the surface of the utility of macro variables. We will see other examples as we go along.
Feeling Loopy
Many programming languages support looping. Stata has several ways of doing loops: foreach, forvalues and while. We don’t have the time to demonstrate all of them, so we will show you two examples of looping across variables.
For the first example we want to create centered variables and squared center variables for five variables.
use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear foreach var of varlist read write math science socst { quietly sum `var' gen c`var' = `var' - r(mean) // create centered variable gen c`var'2 = c`var'^2 // create squared centered variable }
For our second example we want to create a long dataset in which our variables are stacked on top on one another.
use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear list id read write math science socst in 1/12, sep(0) +---------------------------------------------+ | id read write math science socst | |---------------------------------------------| 1. | 45 34 35 41 29 26 | 2. | 108 34 33 41 36 36 | 3. | 15 39 39 44 26 42 | 4. | 67 37 37 42 33 32 | 5. | 153 39 31 40 39 51 | 6. | 51 42 36 42 31 39 | 7. | 164 31 36 46 39 46 | 8. | 133 50 31 40 34 31 | 9. | 2 39 41 33 42 41 | 10. | 53 34 37 46 39 31 | 11. | 1 34 44 40 39 41 | 12. | 128 39 33 38 47 41 | +---------------------------------------------+ local i = 1 foreach var of varlist read write math science socst { rename `var' y`i' local i = `i' + 1 } list id y* in 1/12, sep(0) +------------------------------+ | id y1 y2 y3 y4 y5 | |------------------------------| 1. | 45 34 35 41 29 26 | 2. | 108 34 33 41 36 36 | 3. | 15 39 39 44 26 42 | 4. | 67 37 37 42 33 32 | 5. | 153 39 31 40 39 51 | 6. | 51 42 36 42 31 39 | 7. | 164 31 36 46 39 46 | 8. | 133 50 31 40 34 31 | 9. | 2 39 41 33 42 41 | 10. | 53 34 37 46 39 31 | 11. | 1 34 44 40 39 41 | 12. | 128 39 33 38 47 41 | +------------------------------+ reshape long y, i(id) j(var) lis id y var in 1/12, sep(0) +---------------+ | id y var | |---------------| 1. | 1 34 1 | 2. | 1 44 2 | 3. | 1 40 3 | 4. | 1 39 4 | 5. | 1 41 5 | 6. | 2 39 1 | 7. | 2 41 2 | 8. | 2 33 3 | 9. | 2 42 4 | 10. | 2 41 5 | 11. | 3 63 1 | 12. | 3 65 2 | +---------------+
It is possible to loop throught observations (rows) but it is usually not necessary. Here is an example in which we create an index value that is equal to the observation number..
use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear local bign = _N generate index = . forvalues i=1/`bign' { quietly replace index=`i' in `i' } list index id in 1/10 +-------------+ | index id | |-------------| 1. | 1 45 | 2. | 2 108 | 3. | 3 15 | 4. | 4 67 | 5. | 5 153 | |-------------| 6. | 6 51 | 7. | 7 164 | 8. | 8 133 | 9. | 9 2 | 10. | 10 53 | +-------------+
Here is a much simpler way to get the same result
drop index generate index = _n
There is no end to the uses for looping. Unfortunately, we don’t have time to cover more today.
Somewhat Iffy
Every good programming language has the ability to conditionally execute commands. Stata is no exception. It has both if and else to allow you to controls the execution of commands. We will illustate the use of if with an example to runs a series of ttests and reports which ones are statistically significant.
foreach var of varlist read write math science { quietly ttest `var', by(female) if r(p) variable read t = .74801096 not statistically significant variable write t = -3.7340739* statistically significant variable math t = .41299865 not statistically significant variable science t = 1.8123753 not statistically significant
Let’s revisit the issue of looping through observation (rows) with an added if statement. Let’s say that we want a list of all the females with reading scores less than 50. Here one way.
use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear display "Females with a reading score less than 50" local bign = _N forvalues i=1/`bign' { if female[`i']==1 & read[`i'] 45 34 51 42 2 39 1 34 106 36 89 35 19 28 ... [output omitted]
Here is a much better (simpler and faster) way to get the same results by subsetting with if as part of list the command.
list id read if female==1 & read<50 [output omitted]
Most of Stata’s commands work the if and in clauses.
Give’em the Boot
Bootstrapping is an alternative method for determining standard errors. For standard estimation commands, bootstrapping is just a matter of using the vce(boot) option.
use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear regress write read female, vce(boot, reps(100)) (running regress on estimation sample) Bootstrap replications (100) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .................................................. 100 Linear regression Number of obs = 200 Replications = 100 Wald chi2(2) = 215.00 Prob > chi2 = 0.0000 R-squared = 0.4394 Adj R-squared = 0.4337 Root MSE = 7.1327 ------------------------------------------------------------------------------ | Observed Bootstrap Normal-based write | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- read | .5658869 .0428593 13.20 0.000 .4818842 .6498897 female | 5.486894 .9479372 5.79 0.000 3.628971 7.344817 _cons | 20.22837 2.423242 8.35 0.000 15.4789 24.97784 ------------------------------------------------------------------------------
But what if we want bootstrap standard errors for a statistic that Stata does not compute. We will need to write our own ado program. Let’s go back to the example of the difference in medians between males and females. We will write a program named med_dif which we will save is a file named med_diff.ado. Here’s how we do this.
/* begin med_dif program */ program med_diff, rclass quietly sum write if female==0, detail local mmedian = r(p50) quietly sum write if female==1, detail local fmedian = r(p50) return scalar meddif = fmedian - mmedian display as txt "median difference = " `fmedian' - `mmedian' end /* end med_dif program */
Let’s try it.
use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear med_diff median difference = 5 ret lis scalars: r(meddif) = 5
Now, let’s use med_diff with the bootstrap command.
bootstrap r(meddif), reps(100): med_diff (running med_diff on estimation sample) Warning: Because med_diff is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the statistics and so assumes that all observations are used. This means that no observations will be excluded from the resampling because of missing values or other reasons. If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded. Be sure that the dataset in memory contains only the relevant data. Bootstrap replications (100) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .................................................. 100 Bootstrap results Number of obs = 200 Replications = 100 command: med_diff _bs_1: r(meddif) ------------------------------------------------------------------------------ | Observed Bootstrap Normal-based | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _bs_1 | 5 2.844435 1.76 0.079 -.5749893 10.57499 ------------------------------------------------------------------------------ estat boot, percentile Bootstrap results Number of obs = 200 Replications = 100 command: med_diff _bs_1: r(meddif) ------------------------------------------------------------------------------ | Observed Bootstrap | Coef. Bias Std. Err. [95% Conf. Interval] -------------+---------------------------------------------------------------- _bs_1 | 5 .99 2.8444346 0 12 (P) ------------------------------------------------------------------------------ (P) percentile confidence interval
Talkin’ ‘Bout My Egeneration
Egen commands can make your life as a programmer much easier by saving you from additional programming. Here is an example that creates a variable containg group means. First, the egen way.
use https://stats.idre.ucla.edu/stat/data/hsbmar, clear egen fmean = mean(write), by(female)
And next, how you might program the same thing yourself.
drop fmean quietly sum write if female==0, meanonly generate fmean = r(mean) if female==0 quietly sum write if female==1, meanonly replace fmean = r(mean) if female==1
Here is another example, in which we create a new variable which is the average of the non-missing values of four variables.
egen amean = rowmean(read write math science) sum amean read write math science
And here is one way to program this. Note that we need to loop over the observations (rows).
drop amean local bign = _N generate amean=. forvalues i=1/`bign' { local sum = 0 local n = 0 foreach var of varlist read write math science { if `var'[`i']~=. { local sum = `sum' + `var'[`i'] local n = `n' + 1 } if `n'~=0 { replace amean = `sum'/`n' in `i' } } }
These are just two of the many egen commands available to programmers.
The Matrix Reloaded
There are two matric systems in Stata: 1) Traditional Stata matrix commands and 2) Mata which is a full programming language in addition to a matrix language. Many of the Stata estimation commands are written in Mata. Here are two examples of doing regression using matrices. We will use the basic regression formula shown below with each method.
b = (X'X)-1X'Y
We want to regres write on female and read. Hopefully the results wil look something like this:
regress write female read Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 2, 197) = 77.21 Model | 7856.32118 2 3928.16059 Prob > F = 0.0000 Residual | 10022.5538 197 50.8759077 R-squared = 0.4394 -------------+------------------------------ Adj R-squared = 0.4337 Total | 17878.875 199 89.843593 Root MSE = 7.1327 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- female | 5.486894 1.014261 5.41 0.000 3.48669 7.487098 read | .5658869 .0493849 11.46 0.000 .468496 .6632778 _cons | 20.22837 2.713756 7.45 0.000 14.87663 25.58011 ------------------------------------------------------------------------------
To begin we need to create a constant for the intercept, which we will call cons. Now, let’s try this with the Stata matrix commands.
generate cons = 1 mkmat female read cons, mat(X) mkmat write, mat(y) matrix b = syminv(X'*X)*X'*y matrix list b b[3,1] write female 5.486894 read .56588693 cons 20.228368
In Mata the code would look like this.
tomata female write read cons mata y = write X = (female, read, cons) b = invsym(X'X)*X'y b end 1 +---------------+ 1 | 5.486893967 | 2 | .5658869298 | 3 | 20.22836845 | +---------------+
These examples demonstrate how matrix commands may be used. Stata’s two matrix systems are powerful tools that can do both simple tasks and allow you to program complex estimation procedures.
Conclusion
Spending some time learning to program in Stata can increase your productivity and efficiency in working with your research data.
updated 10/11/11