Statistical Computing Seminars Introduction to Stata Programming

Many researchers use Stata without ever writing a program even though programming could make them more efficient in their data analysis projects. Stata programming is not difficult since it mainly involves the use of Stata commands to you already use. The trick to Stata programming is to use the appropriate commands in the right sequence. Of course, this is the trick to any kind of programming.

There are two kinds of files that are used in Stata programming, do-files and ado-files. Do-files are run from the command line using the do command, for example,

do hsbcheck hsberr

Ado-files, on the other hand, work like ordinary Stata commands by just using the file name in the command line, for example,

median3 read write math science

In fact, many of the built-in Stata commands are just ado-files. You can look at the source code for the ado commands using the viewsource command, for example,

viewsource regress.ado

Do-files can be placed in the same folder as the data but ado-files need to go where Stata can find them. The best place for user written ado-files is in the /ado/personal/ directory. The location of this directory can vary for system to system.

We will try to give users a feel for Stata programming by covering the following topics:

Creating and using do-files for checking and cleaning data.
Using do-files for analyzing data.
Programming tools and tips.
Writing an ado program to create a statistical command.
Creating an ado-file that uses Mata, Stata’s matrix programming language.

Part 1: Creating and using do-files for checking and cleaning data

We will create a file, hsbcheck.do, that contains commands that will display observations with incorrect or impossible values.

/* begin hsbcheck.do */
version 9.2
clear
use `1'
set more off

nmissing
describe
summarize
list id if id<1 | id>200
list id gender if ~ inlist(gender,1,2,.)  
list id race   if ~ inlist(race,1,2,3,4,.)  
list id ses    if ~ inlist(ses,1,2,3,.) 
list id schtyp if ~ inlist(schtyp,1,2,.)   
list id prog   if ~ inlist(prog,1,2,3,.)       
list id read    if (read<1    | read>99)      & read~=.
list id write   if (write<1   | write>99)     & write~=.
list id math    if (math<1    | math>99)      & math~=.
list id science if (science<1 | science>99)   & science~=.
list id socst   if (socst<1   | socst>99)     & socst~=.
list id female  if (female<0  | female>1)     & female~=.
list id region  if (region<1  | region>3)     & region~=.

set more on
/* end hsbcheck.do */

Here is how our do-file is used with the dataset hsberr.

do hsbcheck hsberr

[output omitted]

So how did the hsbcheck program “know” which file to use? This was done using a macro variable, in this case, `1′, which takes the first term typed after the name fof the program and treats it as as file name. Macro variables have many uses including as variable names or numeric values. We will see additional uses of macro variables in other programs.

Now that we know what errors there are in the data we can write a do-file that will fix the errors. When we know the correct value of an observation, we will replace the incorrect value with the correct one. When we do not know the correct value for an observation, we will replace the incorrect value with missing. The do-file hsbfix.do will read in hsberr, correct the errors and save the corrected file as hsbclean. Here is what hsbfix.do looks like.

/* begin hsbfix.do */
version 9.2
clear
use hsberr
replace id=193     if id==1193
  replace read=47    if read==147
  replace science=61 if science==-61
  replace gender=.   if gender==5
  replace race=.     if race<1 | race>4
  replace ses=.      if ses<1 | ses>3
  replace schtyp=.   if schtyp<1 | schtyp>2
  replace prog=.     if prog<1 | prog>3
  
  tab gender
  
  tab reg
  
  /* create female from gender */
  generate female = gender
  recode female 1=0 2=1
  label define fem 0 "male" 1 "female"
  label value female fem
  tab female gender
  
  /* create numeric region from string reg */
  encode reg, gen(region)
  recode region (1/3=1)(7=1)(4/5=2)(6=3)
  label define region 1 "Los Angeles" 2 "Orange" 3 "Riverside", modify
  tab reg region
  
  label data "hsb clean data using hsberr.do"
  save hsbclean, replace
/* end hsbfix.do */

Notes on hsbcheck.do:

1) One important thing to note is that after we fix the incorrect values, we will save the data file with a new name. We will never change any of the values in the original data file, hsberr.

2) We created two varibles, female and region that did not exist ins the originaldataset hsberr. Gender is an ambiguous variable name while female is very straightforward and hardly needs value labels. The variable reg has many problems common to string variables. The region “Los Angeles” was coded at least four different ways. The solution used here (one of many possible solutions) was to convert the variable into a numeric variable, recode observations into the correct regions and to create value labels for the regions.

First, we will run hsbfix on the original file hsberr then, as a check, we will run hsbcheck on the new file hsbclean.

do hsbfix

do hsbcheck hsbclean

[output omitted]

Part 2: Using do-files for analyzing data

Next, we will create a do-file that contains all of the commands that we need to run our data analysis. This do-file will be called hsbanalyze.do.

/* begin hsbanalyze.do */
verson 9.2
log using hsb10_14_08.txt, text replace
summarize read write math science 
univar read write math science 
tabstat write, stat(n mean sd p25 p50 p75) by(female)
tabstat write, stat(n mean sd p25 p50 p75) by(prog)
ttest write, by(gender)
histogram write, normal start(30) width(5)
more
kdensity write, normal
more
tab1 female ses prog
tab prog ses, all 
correlate write read science female
more
regress write read female
rvfplot
more
ologit ses read write female
gologit2 ses read write female, autofit
mlogit prog read write female
log close
/* end hsbanalyze */

Now, let’s use hsbanalyze with our data file hsbclean.

use hsbclean, clear

do hsbanalyze

[output omitted]

This may not seem all that useful; after all, you could just as easily type each of the commands into the command window, but what if your coauthor comes to you and says, “we need to redo the whole analysis using only schtyp equal to one.” Here’s all you have to do.

keep if schtyp==1

do hsbanalyze

[output omitted]

Part 3: Writing an ado program to create a statistical command

Now, let’s try our hand at writing a statistical command. Ado programs are very much like do-file programs with the advantage that you just have to type the name of the command. You will need to include two additional commands to create an ado program. You begin with program define and the name of the command and you end with an end command. Also, you need to save the file as a .ado using the same name for the file as the name of the new command.

We will illustrate the ado program by writing a command that computes the median. Of course, Stata already have commands that compute medians but we are doing this to illustrate the process of creating an ado program.

The basic logic of computing the median is to sort the variable of interest then, if there are an odd number of values take the middle one, an if there are an even number of values take half the distance between the middle two. Below is the first version of our program which is saved in the file names median1.ado.

/* Median Program -- Version #1           */
/* basic program with one variable        */

program define median1
  version 9.2
  sort `1'
  quietly count if `1' ~= .
  local n = r(N)
  local mid = int(`n'/2)
  local odd = mod(`n',2)
  
  if `odd' {
    local median = `1'[`mid'+1]
  }
  else {
    local median = (`1'[`mid'] + `1'[`mid'+1])/2
  }
  
  display
  display as text "Median of `1' = " as result `median'
end 

/* end median1.ado  */

Let’s try median1 on the hsbclean dataset.

use hsbclean

median1 write

Median of write = 54

This program worked just fine but it could be improved. We will modify the program to improve the output format and to allow for multiple variables. We will call this new program median2 which will be saved in the file median2.ado.

/* Median Program -- Version #2                       */
/* multiple variables; saves results in return list   */

program define median2, rclass
  version 9.2
  syntax varlist(numeric)
  
  preserve /* preserve data */
  
  display
  display as text " Variable        N    Median"
  display as text "----------------------------"
  
  foreach var of local varlist {
    quietly count if `var' ~= .
    local n = r(N)
    local mid = int(`n'/2)
    local odd = mod(`n',2)
    sort `var'
    
    if `odd' {
      local median = `var'[`mid'+1]
    }
    else {
      local median = (`var'[`mid'] + `var'[`mid'+1]) / 2
    }
    display as result %9s "`var'" %9.0f `n' %10.2f `median'
  }
  return scalar Mdn = `median'
  return scalar N = `n'
  
  restore /* restore data */
end 

/* end median2.ado  */

Here is how median2 works.

median2 read write math science

 Variable        N    Median
----------------------------
     read      200     50.00
    write      200     54.00
     math      200     52.00
  science      195     53.00

Again, everything seems to be working fine but it would be better if the program allowed the use of if or in to subset the data. For example, what if we wanted the medians just for males. The program median3 will allow the user to use if and in.

/* Median Program -- Version #3    */ 
/* Allows for if and in            */

program define median3, rclass
  version 9.2
  syntax varlist(numeric) [if] [in]        
  
  display
  display as text " Variable        N    Median"
  display as text "----------------------------"
  
  foreach var of local varlist {
    preserve 
    if "`if'"~="" | "`in'"~="" {
      quietly keep `if'  `in' 
    }
    quietly keep if ~missing(`var')
    quietly count 
    local n = r(N)
    local mid = int(`n'/2)
    local odd = mod(`n',2)
    sort `var'
    
    if `odd' {
      local median = `var'[`mid'+1]
    }
    else {
      local median = (`var'[`mid'] + `var'[`mid'+1])/2
    }
    display as result %9s "`var'" %9.0f `n' %10.2f `median'
    restore
  }

  return scalar Mdn = `median'
  return scalar N = `n'
end
/* end median3.ado  */

Here is how this version of the program works.

median3 read write math science if female==0

 Variable        N    Median
----------------------------
     read       90     52.00
    write       90     50.50
     math       90     51.50
  science       85     55.00

median3 read write math science in 50/150

 Variable        N    Median
----------------------------
     read      101     50.00
    write      101     54.00
     math      101     52.00
  science       98     53.00

The rclass option in median2 and median3 allow you to temporarily store the results from a program. For our program, we keep the frequency and median for the last variable used in the command. Here is how you can view the values being stored.

return list

scalars:
                  r(N) =  99
                r(Mdn) =  50

Stata has included some tools to make creating and debugging program as bit easier. One of them is the trace option that shows you command by command what is happening inside your program. Let’s run median3 with trace turned on and see what it looks like.

set trace on

median3 science if female==0

[output omitted]

set trace off

Part 4: Creating an ado-file that uses the Stata matrix operations for performing an analysis

Many statistical procedures are done more easily using matrix commands. Stata has two different ways that you you can use matrix commands in your programs. First, Stata has a fairly complete set of matrix commands build right into Stata itself. Second, Stata has a complete matrix programming language called Mata. Mata is faster and more powerful than Stata’s built-in matrix commands but the built-in commands are easier to program for small to medium programming projects. We will illustrate Stata’s built-in matrix commands by writing an OLS regression program.

We will begin with matreg1.ado that makes use of the famous matrix equation for the regression coefficients, b=(X’X)^-1X’Y.

/*  begin matreg1.ado  */
program define matreg1
  version 9.0
 
    syntax varlist(min=2 numeric) [if] [in] [, Level(integer $S_level)]
    marksample touse                       /* mark cases in the sample */
    tokenize "`varlist'"

    quietly matrix accum sscp = `varlist' if `touse'
    matrix XX = sscp[2...,2...]            /* X'X */
    matrix Xy = sscp[1,2...]               /* X'y */
    matrix b = Xy * syminv(XX)             /* (X'X)-1X'y */
    
    local k = colsof(b)                    /* number of coefs */
    local nobs = r(N)
    local df = `nobs' - (rowsof(sscp) - 1) /* df residual */
    matrix hat = Xy * b'
    matrix V = syminv(XX) * (sscp[1,1] - hat[1,1])/`df'
    matrix C = corr(V)
    matrix seb = vecdiag(V)
    matrix seb = seb[1, 1...]
    matrix t = J(1,`k',0)
    matrix p = t

    local i = 1
    while `i' <= `k' {
      matrix seb[1,`i'] = sqrt(seb[1,`i'])
      matrix t[1,`i'] = b[1,`i']/seb[1,`i']
      matrix p[1,`i'] = tprob(`df',t[1,`i'])
      local i = `i' + 1
    }

  display
  display "Dependent variable: `1'"
  display
  display "Regression coefficients"
  matrix list b
  display
  display "Standard error of coefficients"
  matrix list seb
  display
  display "Values of t"
  matrix list t
  display
  display "P values for t"
  matrix list p
  display
  display "Covariance of the regression coefficients"
  matrix list V
  display
  display "Correlation of the regression coefficients"
  matrix list C

  matrix drop sscp XX Xy b hat V seb t p C 
end 
/*  end matreg1.ado  */

Here is an example of how to use matreg1.

use https://stats.idre.ucla.edu/stat/data/hsb2, clear

regress write read female

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  2,   197) =   77.21
       Model |  7856.32118     2  3928.16059           Prob > F      =  0.0000
    Residual |  10022.5538   197  50.8759077           R-squared     =  0.4394
-------------+------------------------------           Adj R-squared =  0.4337
       Total |   17878.875   199   89.843593           Root MSE      =  7.1327

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        read |   .5658869   .0493849    11.46   0.000      .468496    .6632778
      female |   5.486894   1.014261     5.41   0.000      3.48669    7.487098
       _cons |   20.22837   2.713756     7.45   0.000     14.87663    25.58011
------------------------------------------------------------------------------

matreg1 write read female

Dependent variable: write

Regression coefficients

b[1,3]
            read     female      _cons
write  .56588693   5.486894  20.228368

Standard error of coefficients

seb[1,3]
         read     female      _cons
r1  .04938488  1.0142614  2.7137564

Values of t

t[1,3]
           c1         c2         c3
r1  11.458708  5.4097434  7.4540105

P values for t

p[1,3]
           c1         c2         c3
r1  1.265e-23  1.818e-07  2.800e-12

Covariance of the regression coefficients

symmetric V[3,3]
              read      female       _cons
  read   .00243887
female   .00265893   1.0287262
 _cons  -.12883112  -.69953157   7.3644737

Correlation of the regression coefficients

symmetric C[3,3]
              read      female       _cons
  read           1
female   .05308387           1
 _cons  -.96129327  -.25414792           1

In matreg1 we computed t-tests and p-values manually. We also managed all of the output display. We can let Stata do all of that for us by using the estimates post command which will greatly simplify the program. We will do this in matreg2 as shown below.

/*  begin matreg2.ado  */
program define matreg2, eclass
  version 9.0
 
    syntax varlist(min=2 numeric) [if] [in] [, Level(integer $S_level)]
    marksample touse                       /* mark cases in the sample */
    tokenize "`varlist'"

    quietly matrix accum sscp = `varlist' if `touse'
    matrix XX = sscp[2...,2...]            /* X'X */
    matrix Xy = sscp[1,2...]               /* X'y */
    matrix b = Xy * syminv(XX)             /* (X'X)-1X'y */
    
    local k = colsof(b)                    /* number of coefs */
    local nobs = r(N)
    local df = `nobs' - (rowsof(sscp) - 1) /* df residual */
    matrix hat = Xy * b'
    matrix V = syminv(XX) * (sscp[1,1] - hat[1,1])/`df'
   
    ereturn post b V, dof(`df') obs(`nobs') depname(`1') ///
              esample(`touse')
    ereturn local depvar "`1'"
    ereturn local cmd "matreg"
    
    display
    ereturn display, level(`level')
 
  matrix drop sscp XX Xy hat
end 
/*  end matreg2.ado  */

Now, check out matreg2.

matreg2 write read female

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        read |   .5658869   .0493849    11.46   0.000      .468496    .6632778
      female |   5.486894   1.014261     5.41   0.000      3.48669    7.487098
       _cons |   20.22837   2.713756     7.45   0.000     14.87663    25.58011
------------------------------------------------------------------------------

The eclass option does for estimation commands what the rclass option does for regular statistical commands. To view the eclass values we need to use the ereturn list command>

ereturn list

scalars:
                  e(N) =  200
               e(df_m) =  2
               e(df_r) =  197
                  e(F) =  77.21062421518363
                 e(r2) =  .4394192130387506
               e(rmse) =  7.132734938503835
                e(mss) =  7856.321182518186
                e(rss) =  10022.5538174818
               e(r2_a) =  .4337280375366059
                 e(ll) =  -675.2152914029984
               e(ll_0) =  -733.0934827146212

macros:
            e(cmdline) : "regress write read female, nohead"
              e(title) : "Linear regression"
                e(vce) : "ols"
             e(depvar) : "write"
                e(cmd) : "regress"
         e(properties) : "b V"
            e(predict) : "regres_p"
              e(model) : "ols"
          e(estat_cmd) : "regress_estat"

matrices:
                  e(b) :  1 x 3
                  e(V) :  3 x 3

functions:
             e(sample)  

matrix list e(b)

e(b)[1,3]
         read     female      _cons
y1  .56588693   5.486894  20.228368

matrix list e(V)

symmetric e(V)[3,3]
              read      female       _cons
  read   .00243887
female   .00265893   1.0287262
 _cons  -.12883112  -.69953157   7.3644737

Conclusion

Spending a some time learning to program in Stata can increase your productivity and efficiency in your data analysis projects.