Introduction to Stata Programming

Many researchers use Stata without ever writing a program even though programming could make them more efficient in their data analysis projects. Stata programming is not difficult since it mainly involves the use of Stata commands that you already use. The trick to Stata programming is to use the appropriate commands in the right sequence. Of course, this is the trick to any kind of programming.

There are two kinds of files that are used in Stata programming, do-files and ado-files. Do-files are run from the command line using the do command, for example,

do hsbcheck hsberr

Ado-files, on the other hand, work like ordinary Stata commands by just using the file name in the command line, for example,

ttest write, by(female)

In fact, many of the built-in Stata commands are just ado-files, like the ttest command shown above. You can look at the source code for the ado commands using the viewsource command, for example,

viewsource ttest.ado

You can also use viewsource with your do-files.

viewsource hsbcheck.do

Do-files can be placed in the same folder as the data but ado-files need to go where Stata can find them. The best place for user written ado-files is in the /ado/personal/ directory. The location of this directory can vary for system to system.

We will try to give you a feel for Stata programming by covering the following topics:

Creating and using do-files for checking and cleaning data.
Using do-files for analyzing data.
Writing an ado program to create a statistical command.
Using Stata and Mata matrix commands

Dirty Data

We will create a do-file, hsbcheck.do, that contains commands that will display observations with incorrect or impossible values.

/* begin hsbcheck.do */
version 11.0
clear
use `1', clear
set more off

nmissing
describe
summarize
list id if id200
list id gender if ~ inlist(gender,1,2,.)  
list id race   if ~ inlist(race,1,2,3,4,.)  
list id ses    if ~ inlist(ses,1,2,3,.) 
list id schtyp if ~ inlist(schtyp,1,2,.)   
list id prog   if ~ inlist(prog,1,2,3,.)   
list id read    if (read99)      & read~=.
list id write   if (write99)     & write~=.
list id math    if (math99)      & math~=.
list id science if (science99)   & science~=.
list id socst   if (socst99)     & socst~=.
*list id female  if (female1)     & female~=.

set more on
/* end hsbcheck.do */

Here is how our do-file is used with the dataset hsberr.

do hsbcheck hsberr

[output omitted]

So how did the hsbcheck program “know” which file to use? This was done using a macro variable, in this case, `1′, which takes the first term typed after the name fof the program and treats it as as file name. Macro variables have many uses including as variable names or numeric values. We will see additional uses of macro variables in other programs.

Now that we know what errors there are in the data we can write a do-file that will fix the errors. When we know the correct value of an observation, we will replace the incorrect value with the correct one. When we do not know the correct value for an observation, we will replace the incorrect value with missing. The do-file hsbfix.do will read in hsberr, correct the errors and save the corrected file as hsbclean. Here is what hsbfix.do looks like.

/* begin hsbfix.do */
  use hsberr, clear

  replace id=193     if id==1193
  replace read=47    if read==147
  replace science=61 if science==-61
  replace gender=.   if gender==5
  replace race=.     if race4
  replace ses=.      if ses3
  replace schtyp=.   if schtyp2
  replace prog=.     if prog3
  tab gender
  tab reg
  /* create female from gender */
  
  generate female = gender
  
  recode female 1=0 2=1
  l
  label define fem 0 "male" 1 "female"
  
  label value female fem
  tab female gender
  label data "hsb clean data using hsberr.do"
  save hsbclean, replace 
/* end hsbfix.do */

One important thing to note is that after we fix the incorrect values, we will save the data file with a new name. We will never change any of the values in the original data file, hsberr.

First, we will run hsbfix on the original file hsberr then, as a check, we will run hsbcheck on the new file hsbclean.

do hsbfix

do hsbcheck hsbclean

[output omitted]

Analyze This!

Next, we will create a do-file that contains all of the commands that we need to run our data analysis. This do-file will be called hsbanalyze.do.

/* begin hsbanalyze.do */
log using hsb_fall_2011.txt, text replace

summarize read write socst 
univar read write math science 
tabstat write, stat(n mean sd p25 p50 p75) by(female)
tabstat write, stat(n mean sd p25 p50 p75) by(prog)
ttest write, by(female)
hist write, normal start(30) width(5) name(hist_write, replace)
kdensity write, normal name(kd_write, replace)
tab1 female ses prog
tab prog ses, all 
correlate write read socst female
graph matrix read socst write, half name(grmat, replace)
regress write read i.female##c.socst
margins female, at(socst=(25(5)70)) atmeans asbalanced
marginsplot, recast(line) recastci(rarea) name(plot, replace)

log close
/* end hsbanalyze */

Now, let’s use hsbanalyze with our data file hsbclean.

use hsbdemo, clear

do hsbanalyze

[output omitted]

This may not seem all that useful; after all, you could just as easily type each of the commands into the command window, but what if your coauthor comes to you and says, “we need to redo the whole analysis using only schtyp equal to one.” Here’s all you have to do.

keep if schtyp==1

do hsbanalyze

[output omitted]

Many Happy Returns

Return list are one of the most powerful and useful features of Stata. There are three commonly used return lists: 1) return list for ordinary nonestimation commands; 2) ereturn list for full estimation commands; and 3) creturn list for a list constants and system parameters.

Let’s start with the creturn list (abbr: cret lis).

cret lis

You can access any the values using c(name) by replacing name with the name of the function. For example to get today’s date and time,

display "Today is" c(current_date) " and the current time is " c(current_time)

This can also be useful appending a date onto a file name.

local today c(current_date)
local today = subinstr(`today'," ","_",.)
save hsb`today', replace

Here is an example of the return list (abbr: ret lis) following the summarize command.

summarize write, detail

                        writing score
-------------------------------------------------------------
      Percentiles      Smallest
 1%           31             31
 5%         35.5             31
10%           39             31       Obs                 200
25%         45.5             31       Sum of Wgt.         200

50%           54                      Mean             52.775
                        Largest       Std. Dev.      9.478586
75%           60             67
90%           65             67       Variance       89.84359
95%           65             67       Skewness      -.4784158
99%           67             67       Kurtosis       2.238527

ret lis

scalars:
                  r(N) =  200
              r(sum_w) =  200
               r(mean) =  52.775
                r(Var) =  89.84359296482411
                 r(sd) =  9.47858602138653
           r(skewness) =  -.4784157665394925
           r(kurtosis) =  2.238527050562138
                r(sum) =  10555
                r(min) =  31
                r(max) =  67
                 r(p1) =  31
                 r(p5) =  35.5
                r(p10) =  39
                r(p25) =  45.5
                r(p50) =  54
                r(p75) =  60
                r(p90) =  65
                r(p95) =  65
                r(p99) =  67

We can use this information to compute a statistics, such as, the coefficient of variation that Stata does not provide.

display as txt "Coefficient of variation = " as res r(sd)/abs(r(mean))*100

Coefficient of variation = 17.960371

The ereturn list (abbr: eret lis) is used following estimation commands, such as, regress, anova, logit, sem, etc. Here is an example following regress.

use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear

regress write read female 

[output omitted]

eret lis

scalars:
                  e(N) =  200
               e(df_m) =  2
               e(df_r) =  197
                  e(F) =  77.21062421518373
                 e(r2) =  .439419213038751
               e(rmse) =  7.132734938503835
                e(mss) =  7856.321182518197
                e(rss) =  10022.5538174818
               e(r2_a) =  .4337280375366064
                 e(ll) =  -675.2152914029984
               e(ll_0) =  -733.0934827146214
               e(rank) =  3

macros:
            e(cmdline) : "regress write read female"
              e(title) : "Linear regression"
          e(marginsok) : "XB default"
                e(vce) : "ols"
             e(depvar) : "write"
                e(cmd) : "regress"
         e(properties) : "b V"
            e(predict) : "regres_p"
              e(model) : "ols"
          e(estat_cmd) : "regress_estat"

matrices:
                  e(b) :  1 x 3
                  e(V) :  3 x 3

functions:
             e(sample)

The matrix e(b) contains the parameter estimates while e(V) has the covariance matrix of the parameter estimates.

mat list e(b)

e(b)[1,3]
         read     female      _cons
y1  .56588693   5.486894  20.228368

mat list e(V)

symmetric e(V)[3,3]
              read      female       _cons
  read   .00243887
female   .00265893   1.0287262
 _cons  -.12883112  -.69953157   7.3644737

You can also access the coefficients and standard errors using _b[varname] and _se[varname]. For example, if you wanted the predicted score for a female with a reading score of 60, you could type the following.

display _b[_cons] + _b[female]*1 + _b[read]*60 

59.668478

Macrobiotics

Macro variables are a good way to store values for later use. Stata supports two kinds of macro variables: 1) global macros and 2) local macros. Global macros are saved until Stata is shut down or the macros are cleared while local macros exist only while the do-file or ado-file is being run. Then they disappear. Macros have a name and a way to refer to the values stored in the macros. For global macros $name refers to the value of the macro called name. For local macros `name’ (watch out for the two kinds quote marks) refers to the value of the macro called name.

Say I wanted to compute the difference in medians for two groups. Here is one way you could do this using local macros.

quietly sum write if female==0, detail
local mmedian = r(p50)
quietly sum write if female==1, detail
local fmedian = r(p50)
display "median difference = " `fmedian' - `mmedian'

median difference = 5

macro lis

[output omitted]

Here is the same thing using global macros.

quietly sum write if female==0, detail
global mmedian = r(p50)
quietly sum write if female==1, detail
global fmedian = r(p50)
display "median difference = " $fmedian - $mmedian

median difference = 5

macro list

fmedian:        57
mmedian:        52

[output omitted]

This only scratches the surface of the utility of macro variables. We will see other examples as we go along.

Feeling Loopy

Many programming languages support looping. Stata has several ways of doing loops: foreach, forvalues and while. We don’t have the time to demonstrate all of them, so we will show you two examples of looping across variables.

For the first example we want to create centered variables and squared center variables for five variables.

use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear
foreach var of varlist read write math science socst {
  quietly sum `var'
  gen c`var' = `var' - r(mean)  // create centered variable
  gen c`var'2 = c`var'^2        // create squared centered variable
}

For our second example we want to create a long dataset in which our variables are stacked on top on one another.

use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear

list id read write math science socst in 1/12, sep(0)

     +---------------------------------------------+
     |  id   read   write   math   science   socst |
     |---------------------------------------------|
  1. |  45     34      35     41        29      26 |
  2. | 108     34      33     41        36      36 |
  3. |  15     39      39     44        26      42 |
  4. |  67     37      37     42        33      32 |
  5. | 153     39      31     40        39      51 |
  6. |  51     42      36     42        31      39 |
  7. | 164     31      36     46        39      46 |
  8. | 133     50      31     40        34      31 |
  9. |   2     39      41     33        42      41 |
 10. |  53     34      37     46        39      31 |
 11. |   1     34      44     40        39      41 |
 12. | 128     39      33     38        47      41 |
     +---------------------------------------------+

local i = 1
foreach var of varlist read write math science socst {
  rename `var' y`i'
  local i = `i' + 1
}

list id y* in 1/12, sep(0)

     +------------------------------+
     |  id   y1   y2   y3   y4   y5 |
     |------------------------------|
  1. |  45   34   35   41   29   26 |
  2. | 108   34   33   41   36   36 |
  3. |  15   39   39   44   26   42 |
  4. |  67   37   37   42   33   32 |
  5. | 153   39   31   40   39   51 |
  6. |  51   42   36   42   31   39 |
  7. | 164   31   36   46   39   46 |
  8. | 133   50   31   40   34   31 |
  9. |   2   39   41   33   42   41 |
 10. |  53   34   37   46   39   31 |
 11. |   1   34   44   40   39   41 |
 12. | 128   39   33   38   47   41 |
     +------------------------------+

reshape long y, i(id) j(var)

lis id y var in 1/12, sep(0)

     +---------------+
     | id    y   var |
     |---------------|
  1. |  1   34     1 |
  2. |  1   44     2 |
  3. |  1   40     3 |
  4. |  1   39     4 |
  5. |  1   41     5 |
  6. |  2   39     1 |
  7. |  2   41     2 |
  8. |  2   33     3 |
  9. |  2   42     4 |
 10. |  2   41     5 |
 11. |  3   63     1 |
 12. |  3   65     2 |
     +---------------+

It is possible to loop throught observations (rows) but it is usually not necessary. Here is an example in which we create an index value that is equal to the observation number..

use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear

local bign = _N
generate index = .
forvalues i=1/`bign' {
  quietly replace index=`i' in `i'
}

list index id in 1/10

     +-------------+
     | index    id |
     |-------------|
  1. |     1    45 |
  2. |     2   108 |
  3. |     3    15 |
  4. |     4    67 |
  5. |     5   153 |
     |-------------|
  6. |     6    51 |
  7. |     7   164 |
  8. |     8   133 |
  9. |     9     2 |
 10. |    10    53 |
     +-------------+

Here is a much simpler way to get the same result

drop index

generate index = _n

There is no end to the uses for looping. Unfortunately, we don’t have time to cover more today.

Somewhat Iffy

Every good programming language has the ability to conditionally execute commands. Stata is no exception. It has both if and else to allow you to controls the execution of commands. We will illustate the use of if with an example to runs a series of ttests and reports which ones are statistically significant.

foreach var of varlist read write math science {
  quietly ttest `var', by(female)
  if r(p)

variable read t = .74801096 not statistically significant
variable write t = -3.7340739* statistically significant
variable math t = .41299865 not statistically significant
variable science t = 1.8123753 not statistically significant

Let’s revisit the issue of looping through observation (rows) with an added if statement. Let’s say that we want a list of all the females with reading scores less than 50. Here one way.

use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear

display "Females with a reading score less than 50"
local bign = _N
forvalues i=1/`bign' {
  if female[`i']==1 & read[`i']

45  34
51  42
2  39
1  34
106  36
89  35
19  28
...
[output omitted]

Here is a much better (simpler and faster) way to get the same results by subsetting with if as part of list the command.

list id read if female==1 & read<50

[output omitted]

Most of Stata’s commands work the if and in clauses.

Give’em the Boot

Bootstrapping is an alternative method for determining standard errors. For standard estimation commands, bootstrapping is just a matter of using the vce(boot) option.

use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear

regress write read female, vce(boot, reps(100))

(running regress on estimation sample)

Bootstrap replications (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..................................................   100

Linear regression                               Number of obs      =       200
                                                Replications       =       100
                                                Wald chi2(2)       =    215.00
                                                Prob > chi2        =    0.0000
                                                R-squared          =    0.4394
                                                Adj R-squared      =    0.4337
                                                Root MSE           =    7.1327

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
       write |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        read |   .5658869   .0428593    13.20   0.000     .4818842    .6498897
      female |   5.486894   .9479372     5.79   0.000     3.628971    7.344817
       _cons |   20.22837   2.423242     8.35   0.000      15.4789    24.97784
------------------------------------------------------------------------------

But what if we want bootstrap standard errors for a statistic that Stata does not compute. We will need to write our own ado program. Let’s go back to the example of the difference in medians between males and females. We will write a program named med_dif which we will save is a file named med_diff.ado. Here’s how we do this.

/* begin med_dif program */
program med_diff, rclass
  quietly sum write if female==0, detail
  local mmedian = r(p50)
  quietly sum write if female==1, detail
  local fmedian = r(p50)
  return scalar meddif = fmedian - mmedian
  display as txt "median difference = " `fmedian' - `mmedian'
end
/* end med_dif program */

Let’s try it.

use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear
med_diff

median difference = 5

ret lis

scalars:
             r(meddif) =  5

Now, let’s use med_diff with the bootstrap command.

bootstrap r(meddif), reps(100): med_diff
(running med_diff on estimation sample)

Warning:  Because med_diff is not an estimation command or does not set e(sample),
          bootstrap has no way to determine which observations are used in
          calculating the statistics and so assumes that all observations are used.
          This means that no observations will be excluded from the resampling
          because of missing values or other reasons.

          If the assumption is not true, press Break, save the data, and drop the
          observations that are to be excluded.  Be sure that the dataset in memory
          contains only the relevant data.

Bootstrap replications (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..................................................   100

Bootstrap results                               Number of obs      =       200
                                                Replications       =       100

      command:  med_diff
        _bs_1:  r(meddif)

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _bs_1 |          5   2.844435     1.76   0.079    -.5749893    10.57499
------------------------------------------------------------------------------

estat boot, percentile

Bootstrap results                               Number of obs      =       200
                                                Replications       =       100

      command:  med_diff
        _bs_1:  r(meddif)

------------------------------------------------------------------------------
             |    Observed               Bootstrap
             |       Coef.       Bias    Std. Err.  [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _bs_1 |           5        .99   2.8444346           0         12   (P)
------------------------------------------------------------------------------
(P)    percentile confidence interval

Talkin’ ‘Bout My Egeneration

Egen commands can make your life as a programmer much easier by saving you from additional programming. Here is an example that creates a variable containg group means. First, the egen way.

use https://stats.idre.ucla.edu/stat/data/hsbmar, clear

egen fmean = mean(write), by(female)

And next, how you might program the same thing yourself.

drop fmean

quietly sum write if female==0, meanonly
generate fmean = r(mean) if female==0
quietly sum write if female==1, meanonly
replace fmean = r(mean) if female==1

Here is another example, in which we create a new variable which is the average of the non-missing values of four variables.

egen amean = rowmean(read write math science)

sum amean read write math science

And here is one way to program this. Note that we need to loop over the observations (rows).

drop amean

local bign = _N
generate amean=.

forvalues i=1/`bign' {
  local sum = 0
  local n = 0
  foreach var of varlist read write math science {
     if `var'[`i']~=. {
       local sum = `sum' + `var'[`i']
       local n = `n' + 1
     }
     if `n'~=0 {
     replace amean = `sum'/`n' in `i'
     }
  }
}

These are just two of the many egen commands available to programmers.

The Matrix Reloaded

There are two matric systems in Stata: 1) Traditional Stata matrix commands and 2) Mata which is a full programming language in addition to a matrix language. Many of the Stata estimation commands are written in Mata. Here are two examples of doing regression using matrices. We will use the basic regression formula shown below with each method.

b = (X'X)^-1X'Y

We want to regres write on female and read. Hopefully the results wil look something like this:

regress write female read

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  2,   197) =   77.21
       Model |  7856.32118     2  3928.16059           Prob > F      =  0.0000
    Residual |  10022.5538   197  50.8759077           R-squared     =  0.4394
-------------+------------------------------           Adj R-squared =  0.4337
       Total |   17878.875   199   89.843593           Root MSE      =  7.1327

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |   5.486894   1.014261     5.41   0.000      3.48669    7.487098
        read |   .5658869   .0493849    11.46   0.000      .468496    .6632778
       _cons |   20.22837   2.713756     7.45   0.000     14.87663    25.58011
------------------------------------------------------------------------------

To begin we need to create a constant for the intercept, which we will call cons. Now, let’s try this with the Stata matrix commands.

generate cons = 1

mkmat female read cons, mat(X)
mkmat write, mat(y)

matrix b = syminv(X'*X)*X'*y

matrix list b

b[3,1]
            write
female   5.486894
  read  .56588693
  cons  20.228368

In Mata the code would look like this.

tomata female write read cons

mata

y = write
X = (female, read, cons)

b = invsym(X'X)*X'y

b

end

                 1
    +---------------+
  1 |  5.486893967  |
  2 |  .5658869298  |
  3 |  20.22836845  |
    +---------------+

These examples demonstrate how matrix commands may be used. Stata’s two matrix systems are powerful tools that can do both simple tasks and allow you to program complex estimation procedures.

Conclusion

Spending some time learning to program in Stata can increase your productivity and efficiency in working with your research data.

updated 10/11/11