Many researchers use Stata without ever writing a program even though programming could make them more efficient in their data analysis projects. Stata programming is not difficult since it mainly involves the use of Stata commands to you already use. The trick to Stata programming is to use the appropriate commands in the right sequence. Of course, this is the trick to any kind of programming.
There are two kinds of files that are used in Stata programming, do-files and ado-files. Do-files are run from the command line using the do command, for example,
- do hsbcheck hsberr
Ado-files, on the other hand, work like ordinary Stata commands by just using the file name in the command line, for example,
- median3 read write math science
In fact, many of the built-in Stata commands are just ado-files. You can look at the source code for the ado commands using the viewsource command, for example,
- viewsource regress.ado
Do-files can be placed in the same folder as the data but ado-files need to go where Stata can find them. The best place for user written ado-files is in the /ado/personal/ directory. The location of this directory can vary for system to system.
We will try to give users a feel for Stata programming by covering the following topics:
- Creating and using do-files for checking and cleaning data.
- Using do-files for analyzing data.
- Programming tools and tips.
- Writing an ado program to create a statistical command.
- Creating an ado-file that uses Mata, Stata’s matrix programming language.
Part 1: Creating and using do-files for checking and cleaning data
We will create a file, hsbcheck.do, that contains commands that will display observations with incorrect or impossible values.
/* begin hsbcheck.do */ version 9.2 clear use `1' set more off nmissing describe summarize list id if id<1 | id>200 list id gender if ~ inlist(gender,1,2,.) list id race if ~ inlist(race,1,2,3,4,.) list id ses if ~ inlist(ses,1,2,3,.) list id schtyp if ~ inlist(schtyp,1,2,.) list id prog if ~ inlist(prog,1,2,3,.) list id read if (read<1 | read>99) & read~=. list id write if (write<1 | write>99) & write~=. list id math if (math<1 | math>99) & math~=. list id science if (science<1 | science>99) & science~=. list id socst if (socst<1 | socst>99) & socst~=. list id female if (female<0 | female>1) & female~=. list id region if (region<1 | region>3) & region~=. set more on /* end hsbcheck.do */
Here is how our do-file is used with the dataset hsberr.
do hsbcheck hsberr [output omitted]
So how did the hsbcheck program “know” which file to use? This was done using a macro variable, in this case, `1′, which takes the first term typed after the name fof the program and treats it as as file name. Macro variables have many uses including as variable names or numeric values. We will see additional uses of macro variables in other programs.
Now that we know what errors there are in the data we can write a do-file that will fix the errors. When we know the correct value of an observation, we will replace the incorrect value with the correct one. When we do not know the correct value for an observation, we will replace the incorrect value with missing. The do-file hsbfix.do will read in hsberr, correct the errors and save the corrected file as hsbclean. Here is what hsbfix.do looks like.
/* begin hsbfix.do */ version 9.2 clear use hsberr replace id=193 if id==1193 replace read=47 if read==147 replace science=61 if science==-61 replace gender=. if gender==5 replace race=. if race<1 | race>4 replace ses=. if ses<1 | ses>3 replace schtyp=. if schtyp<1 | schtyp>2 replace prog=. if prog<1 | prog>3 tab gender tab reg /* create female from gender */ generate female = gender recode female 1=0 2=1 label define fem 0 "male" 1 "female" label value female fem tab female gender /* create numeric region from string reg */ encode reg, gen(region) recode region (1/3=1)(7=1)(4/5=2)(6=3) label define region 1 "Los Angeles" 2 "Orange" 3 "Riverside", modify tab reg region label data "hsb clean data using hsberr.do" save hsbclean, replace /* end hsbfix.do */
Notes on hsbcheck.do:
1) One important thing to note is that after we fix the incorrect values, we will save the data file with a new name. We will never change any of the values in the original data file, hsberr.
2) We created two varibles, female and region that did not exist ins the originaldataset hsberr. Gender is an ambiguous variable name while female is very straightforward and hardly needs value labels. The variable reg has many problems common to string variables. The region “Los Angeles” was coded at least four different ways. The solution used here (one of many possible solutions) was to convert the variable into a numeric variable, recode observations into the correct regions and to create value labels for the regions.
First, we will run hsbfix on the original file hsberr then, as a check, we will run hsbcheck on the new file hsbclean.
do hsbfix do hsbcheck hsbclean [output omitted]
Part 2: Using do-files for analyzing data
Next, we will create a do-file that contains all of the commands that we need to run our data analysis. This do-file will be called hsbanalyze.do.
/* begin hsbanalyze.do */ verson 9.2 log using hsb10_14_08.txt, text replace summarize read write math science univar read write math science tabstat write, stat(n mean sd p25 p50 p75) by(female) tabstat write, stat(n mean sd p25 p50 p75) by(prog) ttest write, by(gender) histogram write, normal start(30) width(5) more kdensity write, normal more tab1 female ses prog tab prog ses, all correlate write read science female more regress write read female rvfplot more ologit ses read write female gologit2 ses read write female, autofit mlogit prog read write female log close /* end hsbanalyze */
Now, let’s use hsbanalyze with our data file hsbclean.
use hsbclean, clear do hsbanalyze [output omitted]
This may not seem all that useful; after all, you could just as easily type each of the commands into the command window, but what if your coauthor comes to you and says, “we need to redo the whole analysis using only schtyp equal to one.” Here’s all you have to do.
keep if schtyp==1 do hsbanalyze [output omitted]
Part 3: Writing an ado program to create a statistical command
Now, let’s try our hand at writing a statistical command. Ado programs are very much like do-file programs with the advantage that you just have to type the name of the command. You will need to include two additional commands to create an ado program. You begin with program define and the name of the command and you end with an end command. Also, you need to save the file as a .ado using the same name for the file as the name of the new command.
We will illustrate the ado program by writing a command that computes the median. Of course, Stata already have commands that compute medians but we are doing this to illustrate the process of creating an ado program.
The basic logic of computing the median is to sort the variable of interest then, if there are an odd number of values take the middle one, an if there are an even number of values take half the distance between the middle two. Below is the first version of our program which is saved in the file names median1.ado.
/* Median Program -- Version #1 */ /* basic program with one variable */ program define median1 version 9.2 sort `1' quietly count if `1' ~= . local n = r(N) local mid = int(`n'/2) local odd = mod(`n',2) if `odd' { local median = `1'[`mid'+1] } else { local median = (`1'[`mid'] + `1'[`mid'+1])/2 } display display as text "Median of `1' = " as result `median' end /* end median1.ado */
Let’s try median1 on the hsbclean dataset.
use hsbclean median1 write Median of write = 54
This program worked just fine but it could be improved. We will modify the program to improve the output format and to allow for multiple variables. We will call this new program median2 which will be saved in the file median2.ado.
/* Median Program -- Version #2 */ /* multiple variables; saves results in return list */ program define median2, rclass version 9.2 syntax varlist(numeric) preserve /* preserve data */ display display as text " Variable N Median" display as text "----------------------------" foreach var of local varlist { quietly count if `var' ~= . local n = r(N) local mid = int(`n'/2) local odd = mod(`n',2) sort `var' if `odd' { local median = `var'[`mid'+1] } else { local median = (`var'[`mid'] + `var'[`mid'+1]) / 2 } display as result %9s "`var'" %9.0f `n' %10.2f `median' } return scalar Mdn = `median' return scalar N = `n' restore /* restore data */ end /* end median2.ado */
Here is how median2 works.
median2 read write math science Variable N Median ---------------------------- read 200 50.00 write 200 54.00 math 200 52.00 science 195 53.00
Again, everything seems to be working fine but it would be better if the program allowed the use of if or in to subset the data. For example, what if we wanted the medians just for males. The program median3 will allow the user to use if and in.
/* Median Program -- Version #3 */ /* Allows for if and in */ program define median3, rclass version 9.2 syntax varlist(numeric) [if] [in] display display as text " Variable N Median" display as text "----------------------------" foreach var of local varlist { preserve if "`if'"~="" | "`in'"~="" { quietly keep `if' `in' } quietly keep if ~missing(`var') quietly count local n = r(N) local mid = int(`n'/2) local odd = mod(`n',2) sort `var' if `odd' { local median = `var'[`mid'+1] } else { local median = (`var'[`mid'] + `var'[`mid'+1])/2 } display as result %9s "`var'" %9.0f `n' %10.2f `median' restore } return scalar Mdn = `median' return scalar N = `n' end /* end median3.ado */
Here is how this version of the program works.
median3 read write math science if female==0 Variable N Median ---------------------------- read 90 52.00 write 90 50.50 math 90 51.50 science 85 55.00 median3 read write math science in 50/150 Variable N Median ---------------------------- read 101 50.00 write 101 54.00 math 101 52.00 science 98 53.00
The rclass option in median2 and median3 allow you to temporarily store the results from a program. For our program, we keep the frequency and median for the last variable used in the command. Here is how you can view the values being stored.
return list scalars: r(N) = 99 r(Mdn) = 50
Stata has included some tools to make creating and debugging program as bit easier. One of them is the trace option that shows you command by command what is happening inside your program. Let’s run median3 with trace turned on and see what it looks like.
set trace on median3 science if female==0 [output omitted] set trace off
Part 4: Creating an ado-file that uses the Stata matrix operations for performing an analysis
Many statistical procedures are done more easily using matrix commands. Stata has two different ways that you you can use matrix commands in your programs. First, Stata has a fairly complete set of matrix commands build right into Stata itself. Second, Stata has a complete matrix programming language called Mata. Mata is faster and more powerful than Stata’s built-in matrix commands but the built-in commands are easier to program for small to medium programming projects. We will illustrate Stata’s built-in matrix commands by writing an OLS regression program.
We will begin with matreg1.ado that makes use of the famous matrix equation for the regression coefficients, b=(X’X)-1X’Y.
/* begin matreg1.ado */ program define matreg1 version 9.0 syntax varlist(min=2 numeric) [if] [in] [, Level(integer $S_level)] marksample touse /* mark cases in the sample */ tokenize "`varlist'" quietly matrix accum sscp = `varlist' if `touse' matrix XX = sscp[2...,2...] /* X'X */ matrix Xy = sscp[1,2...] /* X'y */ matrix b = Xy * syminv(XX) /* (X'X)-1X'y */ local k = colsof(b) /* number of coefs */ local nobs = r(N) local df = `nobs' - (rowsof(sscp) - 1) /* df residual */ matrix hat = Xy * b' matrix V = syminv(XX) * (sscp[1,1] - hat[1,1])/`df' matrix C = corr(V) matrix seb = vecdiag(V) matrix seb = seb[1, 1...] matrix t = J(1,`k',0) matrix p = t local i = 1 while `i' <= `k' { matrix seb[1,`i'] = sqrt(seb[1,`i']) matrix t[1,`i'] = b[1,`i']/seb[1,`i'] matrix p[1,`i'] = tprob(`df',t[1,`i']) local i = `i' + 1 } display display "Dependent variable: `1'" display display "Regression coefficients" matrix list b display display "Standard error of coefficients" matrix list seb display display "Values of t" matrix list t display display "P values for t" matrix list p display display "Covariance of the regression coefficients" matrix list V display display "Correlation of the regression coefficients" matrix list C matrix drop sscp XX Xy b hat V seb t p C end /* end matreg1.ado */
Here is an example of how to use matreg1.
use https://stats.idre.ucla.edu/stat/data/hsb2, clear regress write read female Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 2, 197) = 77.21 Model | 7856.32118 2 3928.16059 Prob > F = 0.0000 Residual | 10022.5538 197 50.8759077 R-squared = 0.4394 -------------+------------------------------ Adj R-squared = 0.4337 Total | 17878.875 199 89.843593 Root MSE = 7.1327 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- read | .5658869 .0493849 11.46 0.000 .468496 .6632778 female | 5.486894 1.014261 5.41 0.000 3.48669 7.487098 _cons | 20.22837 2.713756 7.45 0.000 14.87663 25.58011 ------------------------------------------------------------------------------ matreg1 write read female Dependent variable: write Regression coefficients b[1,3] read female _cons write .56588693 5.486894 20.228368 Standard error of coefficients seb[1,3] read female _cons r1 .04938488 1.0142614 2.7137564 Values of t t[1,3] c1 c2 c3 r1 11.458708 5.4097434 7.4540105 P values for t p[1,3] c1 c2 c3 r1 1.265e-23 1.818e-07 2.800e-12 Covariance of the regression coefficients symmetric V[3,3] read female _cons read .00243887 female .00265893 1.0287262 _cons -.12883112 -.69953157 7.3644737 Correlation of the regression coefficients symmetric C[3,3] read female _cons read 1 female .05308387 1 _cons -.96129327 -.25414792 1
In matreg1 we computed t-tests and p-values manually. We also managed all of the output display. We can let Stata do all of that for us by using the estimates post command which will greatly simplify the program. We will do this in matreg2 as shown below.
/* begin matreg2.ado */ program define matreg2, eclass version 9.0 syntax varlist(min=2 numeric) [if] [in] [, Level(integer $S_level)] marksample touse /* mark cases in the sample */ tokenize "`varlist'" quietly matrix accum sscp = `varlist' if `touse' matrix XX = sscp[2...,2...] /* X'X */ matrix Xy = sscp[1,2...] /* X'y */ matrix b = Xy * syminv(XX) /* (X'X)-1X'y */ local k = colsof(b) /* number of coefs */ local nobs = r(N) local df = `nobs' - (rowsof(sscp) - 1) /* df residual */ matrix hat = Xy * b' matrix V = syminv(XX) * (sscp[1,1] - hat[1,1])/`df' ereturn post b V, dof(`df') obs(`nobs') depname(`1') /// esample(`touse') ereturn local depvar "`1'" ereturn local cmd "matreg" display ereturn display, level(`level') matrix drop sscp XX Xy hat end /* end matreg2.ado */
Now, check out matreg2.
matreg2 write read female ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- read | .5658869 .0493849 11.46 0.000 .468496 .6632778 female | 5.486894 1.014261 5.41 0.000 3.48669 7.487098 _cons | 20.22837 2.713756 7.45 0.000 14.87663 25.58011 ------------------------------------------------------------------------------
The eclass option does for estimation commands what the rclass option does for regular statistical commands. To view the eclass values we need to use the ereturn list command>
ereturn list scalars: e(N) = 200 e(df_m) = 2 e(df_r) = 197 e(F) = 77.21062421518363 e(r2) = .4394192130387506 e(rmse) = 7.132734938503835 e(mss) = 7856.321182518186 e(rss) = 10022.5538174818 e(r2_a) = .4337280375366059 e(ll) = -675.2152914029984 e(ll_0) = -733.0934827146212 macros: e(cmdline) : "regress write read female, nohead" e(title) : "Linear regression" e(vce) : "ols" e(depvar) : "write" e(cmd) : "regress" e(properties) : "b V" e(predict) : "regres_p" e(model) : "ols" e(estat_cmd) : "regress_estat" matrices: e(b) : 1 x 3 e(V) : 3 x 3 functions: e(sample) matrix list e(b) e(b)[1,3] read female _cons y1 .56588693 5.486894 20.228368 matrix list e(V) symmetric e(V)[3,3] read female _cons read .00243887 female .00265893 1.0287262 _cons -.12883112 -.69953157 7.3644737
Conclusion
Spending a some time learning to program in Stata can increase your productivity and efficiency in your data analysis projects.