Multilevel Modeling in Stata 12

The purpose of this seminar is to introduce multilevel modeling using Stata 12. Before we begin, you will want to be sure that your copy of Stata is up-to-date. To do this, please type

update all

in the Stata command window and follow any instructions given. These updates include not only fixes to known bugs, but also add some new features that may be useful. I am using Stata 12.1.

Before we begin looking at examples in Stata, we will review some basic issues and concepts in multilevel data analysis. Please note that in this seminar we will only be considering outcomes that are continuous.

Why do we need multilevel analysis and how is it different from ordinary least squares regression?

Sampling designs

Most people do not conduct their own surveys. Rather, they use survey data that some agency or company collected and made available to the public. The documentation must be read carefully to find out what kind of sampling design was used to collect the data. This is very important because many of the estimates and standard errors are calculated differently for the different sampling designs. Hence, if you mis-specify the sampling design, the point estimates and standard errors will likely be wrong.

Below are some common features of many sampling designs.

Sampling weights: There are several types of weights that can be associated with a survey. Perhaps the most common is the sampling weight, sometimes called a probability weight, which is used to weight the sample back to the population from which the sample was drawn. By definition, this weight is the inverse of the probability of being included in the sample due to the sampling design (except for a certainty PSU, see below). The probability weight, called a pweight in Stata, is calculated as N/n, where N = the number of elements in the population and n = the number of elements in the sample. For example, if a population has 10 elements and 3 are sampled at random with replacement, then the probability weight would be 10/3 = 3.33. In a two-stage design, the probability weight is calculated as f₁f₂, which means that the inverse of the sampling fraction for the first stage is multiplied by the inverse of the sampling fraction for the second stage. Under many sampling plans, the sum of the probability weights will equal the population total.

While many textbooks will end their discussion of probability weights here, this definition does not fully describe the probability weights that are included with actual survey data sets. Rather, the "final weight" usually starts with the inverse of the sampling fraction, but then incorporates several other values, such as corrections for unit non-response, errors in the sampling frame (sometimes called non-coverage), and poststratification. Because these other values are included in the probability weight that is included with the data set, it is often inadvisable to modify the probability weights, such as trying to standardize them for a particular variable, e.g., age.

PSU: This is the primary sampling unit. This is the first unit that is sampled in the design. For example, school districts from California may be sampled and then schools within districts may be sampled. The school district would be the PSU. If states from the US were sampled, and then school districts from within each state, and then schools from within each district, then states would be the PSU. One does not need to use the same sampling method at all levels of sampling. For example, probability-proportional-to-size sampling may be used at level 1 (to select states), while cluster sampling is used at level 2 (to select school districts). In the case of a simple random sample, the PSUs and the elementary units are the same. In general, accounting for the clustering in the data (i.e., using the PSUs), will increase the standard errors of the estimates. Conversely, ignoring the PSUs will tend to yield standard errors that are too small, leading to false positives when doing significance tests.

Strata: Stratification is a method of breaking up the population into different groups, often by demographic variables such as gender, race or SES. Each element in the population must belong to one, and only one, strata. Once the strata have been defined, one samples from each stratum as if it were independent of all of the other strata. For example, if a sample is to be stratified on gender, men and women would be sampled independent of one another. This means that the probability weights for men will likely be different from the probability weights for the women. In most cases, you need to have two or more PSUs in each stratum. The purpose of stratification is to reduce the standard error of the estimates, and stratification works most effectively when the variance of the dependent variable is smaller within the strata than in the sample as a whole.

FPC: This is the finite population correction. This is used when the sampling fraction (the number of elements or respondents sampled relative to the population) becomes large. The FPC is used in the calculation of the standard error of the estimate. If the value of the FPC is close to 1, it will have little impact and can be safely ignored. In some survey data analysis programs, such as SUDAAN, this information will be needed if you specify that the data were collected without replacement (see below for a definition of "without replacement"). The formula for calculating the FPC is ((N-n)/(N-1))^1/2, where N is the number of elements in the population and n is the number of elements in the sample. To see the impact of the FPC for samples of various proportions, suppose that you had a population of 10,000 elements.

Sample size (n)    FPC
1                1.0000
10                .9995 
100               .9950
500               .9747
1000              .9487
5000              .7071
9000              .3162

Replicate weights: Replicate weights are a series of weight variables that are used to correct the standard errors for the sampling plan. They serve the same function as the PSU and strata (which use a Taylor series linearization) to correct the standard errors of the estimates for the sampling design. Many public use data sets are now being released with replicate weights instead of PSUs and strata in an effort to more securely protect the identity of the respondents. In theory, the same standard errors will be obtained using either the PSU and strata or the replicate weights. There are different ways of creating replicate weights; the method used is determined by the sampling plan. The most common are balanced repeated and jackknife replicate weights. You will need to read the documentation for the survey data set carefully to learn what type of replicate weight is included in the data set; specifying the wrong type of replicate weight will likely lead to incorrect standard errors. For more information on replicate weights, please see Stata Library: Replicate Weights and Appendix D of the WesVar Manual by Westat, Inc. Several statistical packages, including Stata, SUDAAN, WesVar and R, allow the use of replicate weights.

Consequences of not using the design elements

Sampling design elements include the sampling weights, post-stratification weights (if provided), PSUs, strata, and replicate weights. Rarely are all of these elements included in a particular public-use data set. However, ignoring the design elements that are included can often lead to inaccurate point estimates and/or inaccurate standard errors.

Sampling with and without replacement

Most samples collected in the real world are collected "without replacement". This means that once a respondent has been selected to be in the sample and has participated in the survey, that particular respondent cannot be selected again to be in the sample. Many of the calculations change depending on if a sample is collected with or without replacement. Hence, programs like SUDAAN request that you specify if a survey sampling design was implemented with our without replacement, and an FPC is used if sampling without replacement is used, even if the value of the FPC is very close to one.

Examples

For the examples in this seminar, we will use the Adult data set from NHANES III. The data set, set up file and documentation can be downloaded from the NHANES web site. An executable file is available that contains the data, SAS code to set up the data, and the documentation: (ADULT.exe (16 MB): Data (65 MB), SAS code (128 KB), Documentation (1 MB)) . The NHANES III data sets were released with both pseudo-PSUs/pseudo-strata and replicate weights. We will show examples using both of these methods of variance estimation. For more information on the setup for NHANES III using other packages, and setups using other commonly used public-use survey data sets, please see our page on sample setups for commonly used survey data sets .

Reading the documentation

The first step in analyzing any survey data set is to read the documentation. With many of the public use data sets, the documentation can be quite extensive and sometimes even intimidating. Instead of trying to read the documentation "cover to cover", there are some parts you will want to focus on. First, read the Introduction. This is usually an "easy read" and will orient you to the survey. There is usually a section or chapter called "Sample Design and Analysis Guidelines", "Variance Estimation", etc. This is the part that tells you about the design elements included with the survey and how to use them. Some even give example code (although usually for SUDAAN). If multiple sampling weights have been included in the data set, there will be some instruction about when to use which one. If there is a section or chapter on missing data or imputation, please read that. This will tell you how missing data were handled. You should also read any documentation regarding the specific variables that you intend to use. As we will see little later on, we will need to look at the documentation to get the value labels for the variables. This is especially important because some of the values are actually missing data codes, and you need to do something so that Stata doesn’t treat those as valid values (or you will get some very "interesting" means, totals, etc.). Despite the length of the SAS code that comes with the data set, the value labels are not included.

Getting the data into Stata

The first, and most obvious, thing that you need to do after downloading the data, is to get the data into Stata. The data file itself is actually an ASCII file, and ASCII files can easily be read into Stata. However, the SAS code contains a lot (but not all) of the information that is needed to make the numbers in the ASCII file meaningful. For example, the SAS code contains the column numbers for each variable, the variable name, and the variable label. It does not, however, contain the value labels; you need to get those from the documentation. If you really had to, you could open the SAS code in any text editor and "copy and paste" the information into a Stata do-file, modify it as needed to make a program that could read the ASCII file. Even if you are not a SAS user, this is probably much more work than making the few necessary modifications to the SAS code provided by NHANES. Let’s look at some of the SAS code to see what we need to modify to get it to run. The first few lines of the SAS code look like this:

FILENAME ADULT "D:QuestionnaireDATADULT.DAT" LRECL=3348;
*** LRECL includes 2 positions for CRLF, assuming use of PC SAS;
DATA WORK;
INFILE ADULT MISSOVER;
LENGTH
SEQN 7
DMPFSEQ 5

and the last few lines look like this:

HAZMNK5R = "Average K5 BP from household and MEC"
HAZNOK5R = "Number of BP’s used for average K5";

There are two things that you will need to change. The first is the path specification in the first line. Leave the quotes and the file name (ADULT.DAT). The second thing that needs to be changed is at the very end of the file. Replace the box with

run;

Once you have made these changes, you can click on the running person at the top of the SAS screen (assuming that nothing is highlighted), and the whole program will run. You should glance through the log file looking for anything in red print that indicates an error (such as the .dat file isn’t in the location that you specified, which is a common mistake). To make sure that everything worked as planned, you can run the following command:

proc contents data = work;
run;

This should give you some output telling you about the data set. Once you know that the data set is in SAS correctly, we can now move it over into Stata. First, we need to save the SAS data set to our hard drive. On the set statement, specify the path where you want the SAS data set saved.

data "D:dataworkingnhanes_adult1";
set work;
run;

Now that the SAS data set is saved to our hard drive, we can use a program like Stat/Transfer to transfer the data into Stata format.

The svyset command

Before you do anything else, please make sure that your Stata is up-to-date. You can type

update all

and follow the instructions given. Stata has made many nice upgrades to the svy: commands since Stata 11 was released, so updating is a really good idea.

When you first try to open the NHANES data set in Stata, you will likely get an error message about being out of memory. Unlike SAS and SPSS, Stata holds a data set in memory, so if an insufficient amount of RAM is allocated to Stata, Stata won’t be able to read in the data set. You get around this by increasing the memory:

set memory 50m

Now you should be able to open the data set. (If 50m doesn’t do it on your computer, try 70m, 90m, etc.) Now that the data are in Stata, we need to do one more thing before starting our analyses: we need to issue the svyset command. The svyset command tells Stata about the design elements in the survey. Once this command has been issued, all you need to do for your analyses is use the svy: prefix before each command. Because the NHANES III data were released with both PSUs/strata and replicate weights, we have a choice of how to specify the svyset command. We will illustrate both ways, starting the use of the PSU/strata variables. Now if you read the NHANES documentation on variance estimation, you would know that these aren’t really the PSUs and strata used in the data collection. Rather, they are pseudo-PSUs and pseudo-strata. Noise was added to the original PSU and strata variables to help protect the identity of the survey respondents, which is why they are called pseudo-PSUs and pseudo-strata. Stata doesn’t care about these variables being "pseudo", and they are used in the svyset command as regular PSU and strata variables. The svyset command looks like this:

use "D:dataworkingnhanes_adult1", clear
svyset sdppsu6 [pweight = wtpfqx6], strata(sdpstra6) singleunit(centered)

      pweight: wtpfqx6
          VCE: linearized
  Single unit: centered
     Strata 1: sdpstra6
         SU 1: sdppsu6
        FPC 1: <zero>

The singleunit option was added in Stata 10. This option allows for different ways of handling a single PSU in a stratum. If you use the default option, missing, then you will get no standard errors when Stata encounters a single PSU in a stratum. This can happen as a result of missing data or subsetting the data. There are three other options. One is certainty, meaning that the singleton PSUs be treated as certainty PSUs; certainty PSUs are PSUs that were selected into the sample with a probability of 1 (in other words, these PSUs were certain to be in the sample) and do not contribute to the standard error. The scaled option gives a scaled version of the certainty option. The scaling factor comes from using the average of the variances from the strata with multiple sampling units for each stratum with one PSU. The centered option centers strata with one sampling unit at the grand mean instead of the stratum mean.

svydescribe

Survey: Describing stage 1 sampling units

      pweight: wtpfqx6
          VCE: linearized
  Single unit: centered
     Strata 1: sdpstra6
         SU 1: sdppsu6
        FPC 1: <zero>

                                      #Obs per Unit
                              ----------------------------
Stratum    #Units     #Obs      min       mean      max   
--------  --------  --------  --------  --------  --------
       1         2       418       194     209.0       224
       2         2       478       222     239.0       256
       3         2       476       216     238.0       260
       4         2       381       180     190.5       201
       5         2       388       182     194.0       206
       6         2       404       194     202.0       210
       7         2       444       210     222.0       234
       8         2       430       210     215.0       220
       9         2       419       198     209.5       221
      10         2       471       229     235.5       242
      11         2       439       201     219.5       238
      12         2       379       179     189.5       200
      13         2       414       203     207.0       211
      14         2       482       228     241.0       254
      15         2       461       227     230.5       234
      16         2       555       273     277.5       282
      17         2       509       249     254.5       260
      18         2       474       236     237.0       238
      19         2       517       231     258.5       286
      20         2       480       233     240.0       247
      21         2       410       193     205.0       217
      22         2       499       227     249.5       272
      23         2       433       186     216.5       247
      24         2       467       224     233.5       243
      25         2       423       203     211.5       220
      26         2       438       179     219.0       259
      27         2       478       203     239.0       275
      28         2       472       233     236.0       239
      29         2       505       252     252.5       253
      30         2       475       224     237.5       251
      31         2       525       246     262.5       279
      32         2       427       207     213.5       220
      33         2       539       262     269.5       277
      34         2       492       222     246.0       270
      35         2       264       130     132.0       134
      36         2       219       108     109.5       111
      37         2       167        77      83.5        90
      38         2       185        86      92.5        99
      39         2       303       135     151.5       168
      40         2       218        98     109.0       120
      41         2       236        86     118.0       150
      42         2       197        91      98.5       106
      43         2       423       190     211.5       233
      44         2       497       220     248.5       277
      45         2       345       166     172.5       179
      46         2       217       107     108.5       110
      47         2       226        97     113.0       129
      48         2       594       274     297.0       320
      49         2       357       172     178.5       185
--------  --------  --------  --------  --------  --------
      49        98     20050        77     204.6       320

svy: tab hssex
(running tabulate on estimation sample)

Number of strata   =        49                 Number of obs      =      20050
Number of PSUs     =        98                 Population size    =  187647206
                                               Design df          =         49

-----------------------
      sex | proportions
----------+------------
        1 |       .4777
        2 |       .5223
          | 
    Total |           1
-----------------------
  Key:  proportions  =  cell proportions

svy: tab hssex, count missing
(running tabulate on estimation sample)

Number of strata   =        49                 Number of obs      =      20050
Number of PSUs     =        98                 Population size    =  187647206
                                               Design df          =         49

----------------------
      sex |      count
----------+-----------
        1 |    9.0e+07
        2 |    9.8e+07
          | 
    Total |    1.9e+08
----------------------
  Key:  count     =  weighted counts

svy: tab hssex, count cellwidth(10) format(%15.2g)
(running tabulate on estimation sample)

Number of strata   =        49                 Number of obs      =      20050
Number of PSUs     =        98                 Population size    =  187647206
                                               Design df          =         49

----------------------
      sex |      count
----------+-----------
        1 |   89637541
        2 |   98009665
          | 
    Total |    1.9e+08
----------------------
  Key:  count       =  weighted counts

label define sex 1 male 2 female
label values hssex sex 

label define race 1 white 2 black 3 other 8 MAUR
* MAUR = "Mexican-American of Unknown Race"
label values dmaracer race 

label define mar 1 "married house" 2 "married not in house"  ///
                 3 "living as married" 4 widowed 5 divorced 6 separated ///
                 7 "never married" 8 blank 9 DK
label values hfa12 mar

label define food 1 enough 2 sometimes 3 "often not enough" 8 blank
label values hff4 food 

label define yn 1 yes 2 no 8 blank 9 DK
foreach var of varlist hfa13 hfe7 hfe8a hfe8b hfe8c hfe8d hfe8e hff6a ///
hff6b hff6c hff6d hff7 hff8 hff9 hff10 hff11 hah13-hah17 hav5 {
label values `var' yn
}

label define hah 1 "no difficulty" 2 "some difficulty" 3 "much difficulty" ///
                 4 "unable to do" 8 blank 9 DK
foreach var of varlist hah1-hah12  {
label values `var' hah
}

svy: tab hah1, count cellwidth(15) format(%15.2f) missing
(running tabulate on estimation sample)

Number of strata   =        49                 Number of obs      =      20050
Number of PSUs     =        98                 Population size    =  187647206
                                               Design df          =         49

---------------------------
difficult |
y walking |
a quarter |
of a mile |           count
----------+----------------
 no diffi |     97132055.68
 some dif |      8379187.23
 much dif |      3032776.77
 unable t |      4769359.13
    blank |       130435.68
       DK |      1236016.62
        . |     72967375.21
          | 
    Total |    187647206.32
---------------------------
  Key:  count            =  weighted counts

svy: tab dmaracer, count cellwidth(15) format(%15.2f) missing
(running tabulate on estimation sample)

Number of strata   =        49                 Number of obs      =      20050
Number of PSUs     =        98                 Population size    =  187647206
                                               Design df          =         49

---------------------------
     race |           count
----------+----------------
    white |    158131126.11
    black |     21728087.51
    other |      7773929.35
     MAUR |        14063.35
          | 
    Total |    187647206.32
---------------------------
  Key:  count            =  weighted counts

svy: tab hfa12, count cellwidth(15) format(%15.2f) missing
(running tabulate on estimation sample)

Number of strata   =        49                 Number of obs      =      20050
Number of PSUs     =        98                 Population size    =  187647206
                                               Design df          =         49

---------------------------
marital   |
status    |           count
----------+----------------
  married |    106397053.60
  married |      2367517.23
 living a |      7442555.25
  widowed |     13310800.60
 divorced |     14540696.43
 separate |      4570546.85
 never ma |     38181973.78
       88 |       826108.26
       99 |         9954.32
          | 
    Total |    187647206.32
---------------------------
  Key:  count            =  weighted counts

* "Highest grade or yr of school completed"
svy: mean hfa8r
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      49       Number of obs    =      20050
Number of PSUs   =      98       Population size  =  187647206
                                 Design df        =         49

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       hfa8r |   12.90589   .1268616      12.65095    13.16082
--------------------------------------------------------------

As of Stata 10, you can request the standard deviation of a variable after running the svy: mean command.

estat sd

-------------------------------------
             |       Mean   Std. Dev.
-------------+-----------------------
       hfa8r |   12.90589    7.880968
-------------------------------------

Using the replicate weights

Let’s reissue the svyset command, this time using the replicate weights. First, we will clear the current settings.

svyset, clear

Next, we need to do a little math to turn the value of Fay’s adjustment into the Fay’s value desired by Stata. Here is the formula: Fay=1-1/sqrt(adjfay) . We will use the vce(brr) and mse options to obtain the standard errors given by SUDAAN.

display 1-(1/sqrt(1.7))
.23303501

svyset [pweight = wtpfqx6], brrweight(wtpqrp1 - wtpqrp52) fay(.23303501) vce(brr) mse

      pweight: wtpfqx6
          VCE: brr
          MSE: on
    brrweight: wtpqrp1 wtpqrp2 wtpqrp3 wtpqrp4 wtpqrp5 wtpqrp6 wtpqrp7 wtpqrp8 wtpqrp9 wtpqrp10 wtpqrp11 wtpqrp12 wtpqrp13 wtpqrp14 wtpqrp15
               wtpqrp16 wtpqrp17 wtpqrp18 wtpqrp19 wtpqrp20 wtpqrp21 wtpqrp22 wtpqrp23 wtpqrp24 wtpqrp25 wtpqrp26 wtpqrp27 wtpqrp28 wtpqrp29
               wtpqrp30 wtpqrp31 wtpqrp32 wtpqrp33 wtpqrp34 wtpqrp35 wtpqrp36 wtpqrp37 wtpqrp38 wtpqrp39 wtpqrp40 wtpqrp41 wtpqrp42 wtpqrp43
               wtpqrp44 wtpqrp45 wtpqrp46 wtpqrp47 wtpqrp48 wtpqrp49 wtpqrp50 wtpqrp51 wtpqrp52
          fay: .23303501
     Strata 1: <one>
         SU 1: <observations>
        FPC 1: <zero>

svy: mean hff5
(running mean on estimation sample)

BRR replications (52)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..

Survey: Mean estimation            Number of obs    =     1333
                                   Population size  =  7092623
                                   Replications     =       52
                                   Design df        =       51

--------------------------------------------------------------
             |                BRR *
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
        hff5 |   15.65079   1.789433      12.05836    19.24323
--------------------------------------------------------------

As you can see, the mean for the variable hff5 is the same whether we used the PSU/strata or the replicate weights, which it must be, since these things only adjust the standard errors.

Let’s briefly consider an issue that arises when data are missing. We will run the nmissing command to confirm that only hff5 has missing data, and then we will run two svy: mean commands, the first with only the variable hfa8r, and the second with both variables. As you can see, the mean for hfa8r is different when run with hff5, because Stata is doing a listwise deletion of incomplete cases before calculating the means. In other words, even though there are no missing data from hfa8r, only 1333 of the 20050 data points are used in the calculation of the mean of hfa8r when you use both variables together.

* search nmissing
nmissing hfa8r hff5

hff5            18717

svy: mean hfa8r
(running mean on estimation sample)

BRR replications (52)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..

Survey: Mean estimation          Number of obs    =      20050
                                 Population size  =  187647206
                                 Replications     =         52
                                 Design df        =         51

--------------------------------------------------------------
             |                BRR *
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       hfa8r |   12.90589   .1109928      12.68306    13.12871
--------------------------------------------------------------

svy: mean hfa8r hff5
(running mean on estimation sample)

BRR replications (52)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..

Survey: Mean estimation            Number of obs    =     1333
                                   Population size  =  7092623
                                   Replications     =       52
                                   Design df        =       51

--------------------------------------------------------------
             |                BRR *
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       hfa8r |    17.9756   1.739453      14.48351     21.4677
        hff5 |   15.65079   1.789433      12.05836    19.24323
--------------------------------------------------------------

Comparing variance estimation techniques

Now that we have seen the use of both the PSU/strata and the replicate weights to adjust the standard errors, let’s compare the two side-by-side. We will start by reissuing the svyset command using the PSU/strata.

svyset, clear

svyset sdppsu6 [pweight = wtpfqx6], strata(sdpstra6) singleunit(centered)

      pweight: wtpfqx6
          VCE: linearized
  Single unit: centered
     Strata 1: sdpstra6
         SU 1: sdppsu6
        FPC 1: <zero>

svy: mean hfa8r
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      49       Number of obs    =      20050
Number of PSUs   =      98       Population size  =  187647206
                                 Design df        =         49

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       hfa8r |   12.90589   .1268616      12.65095    13.16082
--------------------------------------------------------------

We can also issue the estat command. The Deff and the Deft are types of design effects, which tell you about the efficiency of your sample. The Deff is a ratio of two variances. In the numerator we have the variance estimate from the current sample (including all of its design elements), and in the denominator we have the variance from a hypothetical sample of the same size drawn as an SRS. In other words, the Deff tells you how efficient your sample is compared to an SRS of equal size. If the Deff is less than 1, your sample is more efficient than SRS; usually the Deff is greater than 1. The Deft is the ratio of two standard error estimates. Again, the numerator is the standard error estimate from the current sample. The denominator is a hypothetical SRS (with replacement) standard error from a sample of the same size as the current sample. You can also use the meff and the meft option to get the misspecification effects. Misspecification effects are a ratio of the variance estimate from the current analysis to a hypothetical variance estimated from a misspecified model. Please see the Stata documentation for more details on how these are calculated.

estat effects

----------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.       Deff      Deft
-------------+--------------------------------------------
       hfa8r |   12.90589   .1268616     5.19536   2.27933
----------------------------------------------------------

Now let’s use the replicate weights and run the same analyses.

svyset, clear

svyset [pweight = wtpfqx6], brrweight(wtpqrp1 - wtpqrp52) fay(.23303501) vce(brr) mse

      pweight: wtpfqx6
          VCE: brr
          MSE: on
    brrweight: wtpqrp1 wtpqrp2 wtpqrp3 wtpqrp4 wtpqrp5 wtpqrp6 wtpqrp7 wtpqrp8 wtpqrp9 wtpqrp10 wtpqrp11 wtpqrp12 wtpqrp13 wtpqrp14 wtpqrp15
               wtpqrp16 wtpqrp17 wtpqrp18 wtpqrp19 wtpqrp20 wtpqrp21 wtpqrp22 wtpqrp23 wtpqrp24 wtpqrp25 wtpqrp26 wtpqrp27 wtpqrp28 wtpqrp29
               wtpqrp30 wtpqrp31 wtpqrp32 wtpqrp33 wtpqrp34 wtpqrp35 wtpqrp36 wtpqrp37 wtpqrp38 wtpqrp39 wtpqrp40 wtpqrp41 wtpqrp42 wtpqrp43
               wtpqrp44 wtpqrp45 wtpqrp46 wtpqrp47 wtpqrp48 wtpqrp49 wtpqrp50 wtpqrp51 wtpqrp52
          fay: .23303501
     Strata 1: <one>
         SU 1: <observations>
        FPC 1: <zero>

svy: mean hfa8r
(running mean on estimation sample)

BRR replications (52)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..

Survey: Mean estimation          Number of obs    =      20050
                                 Population size  =  187647206
                                 Replications     =         52
                                 Design df        =         51

--------------------------------------------------------------
             |                BRR *
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       hfa8r |   12.90589   .1109928      12.68306    13.12871
--------------------------------------------------------------

estat effects

----------------------------------------------------------
             |                BRR *
             |       Mean   Std. Err.       Deff      Deft
-------------+--------------------------------------------
       hfa8r |   12.90589   .1109928      3.9769   1.99422
----------------------------------------------------------

As you can see, every part of the output is exactly the same except the denominator df, the design df and the standard errors. Notice that the standard errors are always larger for the analyses with the PSU/strata set than with the replicate weights. That is not because the replicate weight method of variance correction is more efficient than the linearization method. Rather, these are pseudo-PSUs and pseudo-strata, and the noise added to the PSUs and strata to make them pseudo is causing the inflation in the standard errors. If we had access to the real PSUs and strata and used those in the analyses, I suspect that the standard errors would be extremely close to those obtained with the replicate weights.

Analyses of subpopulations

The analysis of subpopulations is one place where survey data and experimental data are quite different. If you have data from an experiment (or quasi-experiment), and you want to analyze the responses from, say, just the women, or just people over age 50, you can just delete the unwanted cases from the data set or use the by: prefix. Survey data are different. With survey data, you (almost) never get to delete any cases from the data set, even if you will never use them in any of your analyses. Because of the way the by: prefix works, you usually don’t use it with survey data either. Instead, Stata has provided two options that allow you to correctly analyze subpopulations of your survey data. These options are subpop and over. The subpop option is sort of like deleting unwanted cases (without really deleting them, of course), and the over option is very similar to by: processing. We will start with some examples of the subpop option.

But first, let’s take a second to see why deleting cases from a survey data set can be so problematic. If the data set is subset (meaning that observations not to be included in the subpopulation are deleted from the data set), the standard errors of the estimates cannot be calculated correctly. When the subpopulation option(s) is used, only the cases defined by the subpopulation are used in the calculation of the estimate, but all cases are used in the calculation of the standard errors. For more information on this issue, please see Sampling Techniques, Third Edition by William G. Cochran (1977) and Small Area Estimation by J. N. K. Rao (2003). Also, if you look in the Stata 9 Survey manual, you will find an entire section (pages 38-43) dedicated to the analysis of subpopulations. The formulas for using both if and subpop are given, along with an explanation of how they are different. If you look at the help for any svy: command, you will see the same warning:

Warning:  Use of if or in restrictions will not produce correct variance
      estimates for subpopulations in many cases.  To compute estimates for
      subpopulations, use the subpop() option.  The full specification for subpop()
      is
            subpop([varname] [if])

So now we know what not to do, so let’s see how to do this right. We will start with a simple mean and then use hssex as our subpopulation variable.

svy: mean hfa8r
(running mean on estimation sample)

BRR replications (52)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..

Survey: Mean estimation          Number of obs    =      20050
                                 Population size  =  187647206
                                 Replications     =         52
                                 Design df        =         51

--------------------------------------------------------------
             |                BRR *
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       hfa8r |   12.90589   .1109928      12.68306    13.12871
--------------------------------------------------------------

svy, subpop(hssex): mean hfa8r

Note: subpop() takes on values other than 0 and 1
      subpop() != 0 indicates subpopulation
(running mean on estimation sample)

BRR replications (52)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..

Survey: Mean estimation          Number of obs    =      20050
                                 Population size  =  187647206
                                 Subpop. no. obs  =      20050
                                 Subpop. size     =  187647206
                                 Replications     =         52
                                 Design df        =         51

--------------------------------------------------------------
             |                BRR *
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       hfa8r |   12.90589   .1109928      12.68306    13.12871
--------------------------------------------------------------

Clearly, something went wrong here. The note at the top of the output tells us what happened: Stata wants the subpopulation variable to be coded 0/1. Let’s look at the coding of hssex.

codebook hssex

---------------------------------------------------------------------------------------------------------------------------------------------
hssex                                                                                                                                     sex
---------------------------------------------------------------------------------------------------------------------------------------------

                  type:  numeric (byte)
                 label:  sex

                 range:  [1,2]                        units:  1
         unique values:  2                        missing .:  0/20050

            tabulation:  Freq.   Numeric  Label
                          9401         1  male
                         10649         2  female

Let’s recode hssex into a new variable, which we will call hssex1, and rerun the analysis.

generate hssex1 = hssex

recode hssex1 (2 = 0)
(hssex1: 10649 changes made)

svy, subpop(hssex1): mean hfa8r
(running mean on estimation sample)

BRR replications (52)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..

Survey: Mean estimation          Number of obs    =      20050
                                 Population size  =  187647206
                                 Subpop. no. obs  =       9401
                                 Subpop. size     = 89637541.1
                                 Replications     =         52
                                 Design df        =         51

--------------------------------------------------------------
             |                BRR *
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       hfa8r |    13.0614   .1534762      12.75328    13.36951
--------------------------------------------------------------

We can also use the over option to get estimates for all categories of the variable. In this case, we get the mean of the highest year of school completed for men and women (1 = male and 2 = female). The over option allows for variables coded 1/2 and for multicategory variables.

svy: mean hfa8r, over(hssex)
(running mean on estimation sample)

BRR replications (52)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..

Survey: Mean estimation          Number of obs    =      20050
                                 Population size  =  187647206
                                 Replications     =         52
                                 Design df        =         51

            1: hssex = 1
            2: hssex = 2

--------------------------------------------------------------
             |                BRR *
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
hfa8r        |
           1 |    13.0614   .1534762      12.75328    13.36951
           2 |   12.76366   .1046096      12.55365    12.97367
--------------------------------------------------------------

svy: mean hfa8r, over(hfa12)
(running mean on estimation sample)

BRR replications (52)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..

Survey: Mean estimation          Number of obs    =      20050
                                 Population size  =  187647206
                                 Replications     =         52
                                 Design df        =         51

            1: hfa12 = 1
            2: hfa12 = 2
            3: hfa12 = 3
            4: hfa12 = 4
            5: hfa12 = 5
            6: hfa12 = 6
            7: hfa12 = 7
           88: hfa12 = 88
           99: hfa12 = 99

--------------------------------------------------------------
             |                BRR *
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
hfa8r        |
           1 |   12.81528   .1065852       12.6013    13.02926
           2 |   11.49089   .3356867      10.81697    12.16481
           3 |   12.41727    .244054      11.92731    12.90723
           4 |   10.84711   .1758893        10.494    11.20022
           5 |   12.79547   .1453802      12.50361    13.08733
           6 |   11.27132   .2968793      10.67531    11.86733
           7 |   12.79864   .1511575      12.49518     13.1021
          88 |   81.54784   2.655634      76.21643    86.87925
          99 |   62.79067   24.52562      13.55343    112.0279
--------------------------------------------------------------

The over option is available only for svy: mean, svy: proportion, svy: ratio and svy: total. Unfortunately, there are no nice display options for svy: total like there are for svy: tab to show the actual values of the totals. We can use the matrix list command to list out the values stored in the matrix, although sometimes those are in scientific notation as well. We can use the estat size command to get the unweighted and weighted size (i.e., count) of each subpopulation. This is a good thing to do, because you need to know how many observations are in each subpopulation. If "99" was a subpopulation of interest to us, we would be in trouble because there are only three observations in that subpopulation.

svy: total hssex, over(hfa12)
(running total on estimation sample)

BRR replications (52)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..

Survey: Total estimation         Number of obs    =      20050
                                 Population size  =  187647206
                                 Replications     =         52
                                 Design df        =         51

            1: hfa12 = 1
            2: hfa12 = 2
            3: hfa12 = 3
            4: hfa12 = 4
            5: hfa12 = 5
            6: hfa12 = 6
            7: hfa12 = 7
           88: hfa12 = 88
           99: hfa12 = 99

--------------------------------------------------------------
             |                BRR *
        Over |      Total   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
hssex        |
           1 |   1.59e+08    2200084      1.54e+08    1.63e+08
           2 |    3573113   284678.1       3001597     4144628
           3 |   1.11e+07   651988.5       9826916    1.24e+07
           4 |   2.45e+07   506533.4      2.35e+07    2.55e+07
           5 |   2.36e+07   962190.5      2.17e+07    2.56e+07
           6 |    7756844     456243       6840898     8672790
           7 |   5.53e+07    1789277      5.17e+07    5.89e+07
          88 |    1203385   243289.5      714960.5     1691809
          99 |   15765.66   10722.36     -5760.372    37291.69
--------------------------------------------------------------

mat list e(b)

e(b)[1,9]
        hssex:     hssex:     hssex:     hssex:     hssex:     hssex:     hssex:     hssex:     hssex:
    _subpop_1  _subpop_2  _subpop_3    widowed   divorced  separated  _subpop_7         88         99
y1  1.585e+08  3573112.5   11135838   24474595   23634198  7756844.2   55314851  1203384.5   15765.66

estat size

            1: hfa12 = 1
            2: hfa12 = 2
            3: hfa12 = 3
            4: hfa12 = 4
            5: hfa12 = 5
            6: hfa12 = 6
            7: hfa12 = 7
           88: hfa12 = 88
           99: hfa12 = 99

----------------------------------------------------------------------
             |                BRR *
        Over |      Total   Std. Err.              Obs            Size
-------------+--------------------------------------------------------
hssex        |
           1 |   1.59e+08    2200084             10241     106397053.6
           2 |    3573113   284678.1               364      2367517.23
           3 |   1.11e+07   651988.5               781      7442555.25
           4 |   2.45e+07   506533.4              2352      13310800.6
           5 |   2.36e+07   962190.5              1388     14540696.43
           6 |    7756844     456243               686      4570546.85
           7 |   5.53e+07    1789277              4135     38181973.78
          88 |    1203385   243289.5               100       826108.26
          99 |   15765.66   10722.36                 3         9954.32
----------------------------------------------------------------------

The subpop option can be combined with the over option. This is handy because if cannot be used with the over option. By combining the options, you can have "the best of both worlds." In the example below, our subpopulation includes only white males, and the mean of education is given for each marital status. Notice that for category "88", the mean is really high (83.44). This is because Stata considers 88 to be a valid value, not the missing data code that it is (according to the documentation).

svy, subpop(hssex1 if dmaracer==1): mean hfa8r, over(hfa12)
(running mean on estimation sample)

BRR replications (52)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..

Survey: Mean estimation          Number of obs    =      20050
                                 Population size  =  187647206
                                 Replications     =         52
                                 Design df        =         51

            1: hfa12 = 1
            2: hfa12 = 2
            3: hfa12 = 3
            4: hfa12 = 4
            5: hfa12 = 5
            6: hfa12 = 6
            7: hfa12 = 7
           88: hfa12 = 88

--------------------------------------------------------------
             |                BRR *
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
hfa8r        |
           1 |   12.85045   .1013903      12.64691      13.054
           2 |   11.15848   .6515755      9.850389    12.46657
           3 |   11.90575   .2359017      11.43216    12.37934
           4 |    10.8672   .3819635      10.10037    11.63402
           5 |    12.8703   .1584319      12.55224    13.18837
           6 |    12.6802   1.564755       9.53882    15.82157
           7 |   12.72527   .2070723      12.30955    13.14098
          88 |   83.43886   3.637654      76.13596    90.74175
--------------------------------------------------------------

For more information on analyzing subpopulations in Stata, please see our Stata FAQ: How can I analyze a subpopulation of my survey data in Stata 9?

Regression analyses

Let’s look at some regression analyses. Stata 9 has a very nice suite of regression commands that can be used with the svy: prefix. Type help survey for a list of commands that can be used with the svy: prefix.

svy: reg hav6s hav2s hav3s hff4 hfa13 hfe7 hfa8r
(running regress on estimation sample)

BRR replications (52)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..

Survey: Linear regression                       Number of obs      =      5866
                                                Population size    =  69288876
                                                Replications       =        52
                                                Design df          =        51
                                                F(   6,     46)    =     83.92
                                                Prob > F           =    0.0000
                                                R-squared          =    0.4285

------------------------------------------------------------------------------
             |                BRR *
       hav6s |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       hav2s |   .0335188   .0164996     2.03   0.047     .0003945    .0666432
       hav3s |   .0458665    .016843     2.72   0.009     .0120528    .0796802
        hff4 |   78.15901   47.06818     1.66   0.103    -16.33432    172.6523
       hfa13 |   14.97253   6.777161     2.21   0.032     1.366808    28.57825
        hfe7 |  -9.015854   11.81718    -0.76   0.449    -32.73983    14.70812
       hfa8r |   .6153294   .6913622     0.89   0.378    -.7726382    2.003297
       _cons |  -59.27143   62.78533    -0.94   0.350    -185.3182    66.77539
------------------------------------------------------------------------------

As we see in the example below, we can put i. before categorical predictor variables to have Stata automatically create dummy variables for us. (As of Stata 11, the xi: prefix is no longer needed. If you are using an earlier version of Stata, you should put the xi: prefix before the svy: prefix.)

svy: reg hav2s hfe7 hfa13 hssex i.dmpcregn hah15
(running regress on estimation sample)

BRR replications (52)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..

Survey: Linear regression                      Number of obs      =      13398
                                               Population size    =  114679831
                                               Replications       =         52
                                               Design df          =         51
                                               F(   7,     45)    =       4.33
                                               Prob > F           =     0.0010
                                               R-squared          =     0.1045

------------------------------------------------------------------------------
             |                BRR *
       hav2s |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        hfe7 |   1078.949   215.3567     5.01   0.000      646.602    1511.295
       hfa13 |    1212.34   267.1979     4.54   0.000     675.9174    1748.762
       hssex |  -403.5371   128.8811    -3.13   0.003    -662.2767   -144.7975
             |
    dmpcregn |
          2  |   33.12827   72.97049     0.45   0.652    -113.3661    179.6226
          3  |  -19.01354   68.06759    -0.28   0.781    -155.6649    117.6378
          4  |    249.894   122.4217     2.04   0.046     4.122106    495.6659
             |
       hah15 |   935.0711   512.2536     1.83   0.074    -93.32097    1963.463
       _cons |  -5005.517   1185.052    -4.22   0.000    -7384.609   -2626.425
------------------------------------------------------------------------------

No svy: prefix is needed before the test command.

test 2.dmpcregn 3.dmpcregn 4.dmpcregn

Adjusted Wald test

 ( 1)  2.dmpcregn = 0
 ( 2)  3.dmpcregn = 0
 ( 3)  4.dmpcregn = 0

       F(  3,    49) =    1.99
            Prob > F =    0.1277

Let’s create a 0/1 variable and run a logistic regression.

generate clubs = hav5

recode clubs (2 = 0)
(clubs: 14184 changes made)

svy: logit clubs hfe7 hfa13 hssex i.dmpcregn hah15
(running logit on estimation sample)

BRR replications (52)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..

Survey: Logistic regression                    Number of obs      =      13398
                                               Population size    =  114679831
                                               Replications       =         52
                                               Design df          =         51
                                               F(   7,     45)    =       7.13
                                               Prob > F           =     0.0000

------------------------------------------------------------------------------
             |                BRR *
       clubs |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        hfe7 |   .1212481   .0502015     2.42   0.019     .0204643    .2220319
       hfa13 |   -.392244   .0611189    -6.42   0.000    -.5149452   -.2695427
       hssex |  -.0568606   .0474936    -1.20   0.237     -.152208    .0384869
             |
    dmpcregn |
          2  |   .0246446   .1229863     0.20   0.842    -.2222608    .2715499
          3  |  -.2444328   .1344003    -1.82   0.075    -.5142527    .0253871
          4  |  -.1163411   .1270524    -0.92   0.364    -.3714094    .1387272
             |
       hah15 |    .362997   .0947355     3.83   0.000     .1728077    .5531864
       _cons |  -.5054835    .212936    -2.37   0.021    -.9329703   -.0779967
------------------------------------------------------------------------------

test 2.dmpcregn 3.dmpcregn 4.dmpcregn

Adjusted Wald test

 ( 1)  [clubs]2.dmpcregn = 0
 ( 2)  [clubs]3.dmpcregn = 0
 ( 3)  [clubs]4.dmpcregn = 0

       F(  3,    49) =    2.12
            Prob > F =    0.1096

We don’t have much time to talk about regression diagnostics here, although that is a common question among researchers who use survey data. Most of the assumptions still apply when using survey data, but they can be more difficult to check. As of Stata 11, most of the diagnostic commands that you would use after regress, logistic, etc., don’t work after svy: regress, svy: logit, etc. Some of the assumptions don’t really apply, though, because of the extremely large sample size involved. Checking assumptions when doing a subpopulation analysis can be even more tricky.

Using multiply imputed data

Some of the NHANES III data sets were released as imputed data sets. This means that some of the variables contained in the data set were multiply imputed. For information on which variables were imputed, the imputation method, etc., please see http://www.cdc.gov/nchs/nhanes.htm and ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/7A/doc/main.pdf .

If you are using Stata 11 or later, we suggest using the built-in Stata prefix mi. If you are using Stata 10 or earlier, please see the last paragraph in this section for suggestions on how to analyze the multiply imputed datasets.

The commands below are used to change the current working directory (cd), to set the memory (set mem), to use the nh3core data set (use), and to create the single multiply imputed data set (called mymi) from the five imputed data sets provided by NHANES (mi import nhanes1).

cd "D:datanhanes IIImiStata_data"
D:Datanhanes IIImiStata_data

set mem 200m
(204800k)

use nh3core

mi import nhanes1 mymi, using(nh3mi1.dta nh3mi2.dta nh3mi3.dta nh3mi4.dta nh3mi5.dta) id(seqn)
(variables sdppsu6 sdpstra6 wtpfqx6 wtpqrp1 wtpqrp2 wtpqrp3 wtpqrp4 wtpqrp5 wtpqrp6
    wtpqrp7 wtpqrp8 wtpqrp9 wtpqrp10 wtpqrp11 wtpqrp12 wtpqrp13 wtpqrp14 wtpqrp15
    wtpqrp16 wtpqrp17 wtpqrp18 wtpqrp19 wtpqrp20 wtpqrp21 wtpqrp22 wtpqrp23 wtpqrp24
    wtpqrp25 wtpqrp26 wtpqrp27 wtpqrp28 wtpqrp29 wtpqrp30 wtpqrp31 wtpqrp32 wtpqrp33
    wtpqrp34 wtpqrp35 wtpqrp36 wtpqrp37 wtpqrp38 wtpqrp39 wtpqrp40 wtpqrp41 wtpqrp42
    wtpqrp43 wtpqrp44 wtpqrp45 wtpqrp46 wtpqrp47 wtpqrp48 wtpqrp49 wtpqrp50 wtpqrp51
    wtpqrp52 hfa7 hfa8 hfa12 hac1a hac1b hac1c hac1d hac1e hac1f hac1g hac1h hac1i hac1j
    hac1k hac1l hac1m hac1n hac1o had1 hae2 hae4a hae4b hae5a hae5b hae6 hae7 haf1 haf10
    hag2 hag3 hag5a hag5b hag5c hag11 hag12 han6hs han6is han6js hap1 hap1a hap2 hap3
    hap10 hap10a har1 har3 har14 har16 har23 har24 har26 har27 hye1g hye1h hye6a hye6b
    hye15 hyh2 hyh10 dmppirif hff1if hab1if ham5if ham6if han6srif haq1if har3rif
    hat28if hazak1if hazak5if hazbk1if hazbk5if hazck1if hazck5if hyd1if hyf2if bdpfndif
    bdpindif bdpkif bdptoaif bdptodif bdptrdif bdpwtdif bmpbutif bmpheaif bmphtif
    bmpkneif bmprecif bmpsthif bmpsb1if bmpsb2if bmpsp1if bmpsp2if bmptr1if bmptr2if
    bmpwstif bmpwtif fepif frpif hdpif hgpif htpif mcpsiif mhpif mvpsiif pbpif phpfstif
    pxpif rcpif rwpif sepif tcpif tgpif tipif fppsudif fppsumif fppsurif pep6g1if
    pep6g2if pep6g3if pep6h1if pep6h2if pep6h3if pep6i1if pep6i2if pep6i3if hff1mi
    hab1mi ham5mi ham6mi han6srmi haq1mi har3rmi hat28mi hazak1mi hazak5mi hazbk1mi
    hazbk5mi hazck1mi hazck5mi hyd1mi hyf2mi bmpsb1mi bmpsb2mi bmpsp1mi bmpsp2mi
    bmptr1mi bmptr2mi pep6g1mi pep6g2mi pep6g3mi pep6h1mi pep6h2mi pep6h3mi pep6i1mi
    pep6i2mi pep6i3mi dropped in m=1)

< output omitted to save space >

(variables sdppsu6 sdpstra6 wtpfqx6 wtpqrp1 wtpqrp2 wtpqrp3 wtpqrp4 wtpqrp5 wtpqrp6
    wtpqrp7 wtpqrp8 wtpqrp9 wtpqrp10 wtpqrp11 wtpqrp12 wtpqrp13 wtpqrp14 wtpqrp15
    wtpqrp16 wtpqrp17 wtpqrp18 wtpqrp19 wtpqrp20 wtpqrp21 wtpqrp22 wtpqrp23 wtpqrp24
    wtpqrp25 wtpqrp26 wtpqrp27 wtpqrp28 wtpqrp29 wtpqrp30 wtpqrp31 wtpqrp32 wtpqrp33
    wtpqrp34 wtpqrp35 wtpqrp36 wtpqrp37 wtpqrp38 wtpqrp39 wtpqrp40 wtpqrp41 wtpqrp42
    wtpqrp43 wtpqrp44 wtpqrp45 wtpqrp46 wtpqrp47 wtpqrp48 wtpqrp49 wtpqrp50 wtpqrp51
    wtpqrp52 hfa7 hfa8 hfa12 hac1a hac1b hac1c hac1d hac1e hac1f hac1g hac1h hac1i hac1j
    hac1k hac1l hac1m hac1n hac1o had1 hae2 hae4a hae4b hae5a hae5b hae6 hae7 haf1 haf10
    hag2 hag3 hag5a hag5b hag5c hag11 hag12 han6hs han6is han6js hap1 hap1a hap2 hap3
    hap10 hap10a har1 har3 har14 har16 har23 har24 har26 har27 hye1g hye1h hye6a hye6b
    hye15 hyh2 hyh10 dmppirif hff1if hab1if ham5if ham6if han6srif haq1if har3rif
    hat28if hazak1if hazak5if hazbk1if hazbk5if hazck1if hazck5if hyd1if hyf2if bdpfndif
    bdpindif bdpkif bdptoaif bdptodif bdptrdif bdpwtdif bmpbutif bmpheaif bmphtif
    bmpkneif bmprecif bmpsthif bmpsb1if bmpsb2if bmpsp1if bmpsp2if bmptr1if bmptr2if
    bmpwstif bmpwtif fepif frpif hdpif hgpif htpif mcpsiif mhpif mvpsiif pbpif phpfstif
    pxpif rcpif rwpif sepif tcpif tgpif tipif fppsudif fppsumif fppsurif pep6g1if
    pep6g2if pep6g3if pep6h1if pep6h2if pep6h3if pep6i1if pep6i2if pep6i3if hff1mi
    hab1mi ham5mi ham6mi han6srmi haq1mi har3rmi hat28mi hazak1mi hazak5mi hazbk1mi
    hazbk5mi hazck1mi hazck5mi hyd1mi hyf2mi bmpsb1mi bmpsb2mi bmpsp1mi bmpsp2mi
    bmptr1mi bmptr2mi pep6g1mi pep6g2mi pep6g3mi pep6h1mi pep6h2mi pep6h3mi pep6i1mi
    pep6i2mi pep6i3mi dropped in m=5)

The mi describe and mi varying commands provide useful information about the data set. The varcase command is a user-written command (search varcase) that changes the case of variable names. It is used here to change some of the variable names from being in all capital letters to being in all lower case letters.

mi describe

  Style:  flongsep mymi
          last mi update 27sep2010 15:21:48, 0 seconds ago

  Obs.:   complete       20,870
          incomplete     13,124  (M = 5 imputations)
          ---------------------
          total          33,994

  Vars.:  imputed:  36; dmppir(3410) bdpfnd(4179+15169) bdpind(4179+15169)
                    bdpk(4179+15169) bdptoa(4179+15169) bdptod(4179+15169)
                    bdptrd(4179+15169) bdpwtd(4179+15169) bmpbut(4258+3446)
                    bmphea(799+24586) bmpht(2613+3446) bmpkne(1658+27398)
                    bmprec(457+28012) bmpsth(3539+3446) bmpwst(4260+3446) bmpwt(2862)
                    fep(5408+2107) frp(5494+2107) hdp(4603+5982) hgp(5515+2107)
                    htp(5517+2107) mcpsi(5518+2107) mhp(5518+2107) mvpsi(5516+2107)
                    pbp(5069+2107) phpfst(4187+2107) pxp(6117+2107) rcp(5517+2107)
                    rwp(5515+2107) sep(3669+11728) tcp(4451+5982) tgp(4497+5982)
                    tip(6085+2107) fppsud(2962+22546) fppsum(3240+22546)
                    fppsur(3237+22546)

          passive: 0

          regular: 0

          system:  2; _mi_id _mi_miss

         (there are 163 unregistered variables)

mi varying

             Possible problem   variable names
  --------------------------------------------------------------------------------------
           imputed nonvarying:  (none)
           passive nonvarying:  (none)
         unregistered varying:  (none)
  *unregistered super/varying:  (none)
   unregistered super varying:  (none)
  --------------------------------------------------------------------------------------
  * super/varying means super varying but would be varying if registered as imputed;
    variables vary only where equal to soft missing in m=0.

varcase(SDPPSU6- PEP6I3IF)

Now we are ready to use the svyset command. We include the mi prefix.

mi svyset sdppsu6 [pweight = wtpfqx6], strata(sdpstra6) singleunit(centered)
(variables SDPPSU6 SDPSTRA6 WTPFQX6 WTPQRP1 WTPQRP2 WTPQRP3 WTPQRP4 WTPQRP5 WTPQRP6
    WTPQRP7 WTPQRP8 WTPQRP9 WTPQRP10 WTPQRP11 WTPQRP12 WTPQRP13 WTPQRP14 WTPQRP15
    WTPQRP16 WTPQRP17 WTPQRP18 WTPQRP19 WTPQRP20 WTPQRP21 WTPQRP22 WTPQRP23 WTPQRP24
    WTPQRP25 WTPQRP26 WTPQRP27 WTPQRP28 WTPQRP29 WTPQRP30 WTPQRP31 WTPQRP32 WTPQRP33
    WTPQRP34 WTPQRP35 WTPQRP36 WTPQRP37 WTPQRP38 WTPQRP39 WTPQRP40 WTPQRP41 WTPQRP42
    WTPQRP43 WTPQRP44 WTPQRP45 WTPQRP46 WTPQRP47 WTPQRP48 WTPQRP49 WTPQRP50 WTPQRP51
    WTPQRP52 HFA7 HFA8 HFA12 HAC1A HAC1B HAC1C HAC1D HAC1E HAC1F HAC1G HAC1H HAC1I HAC1J
    HAC1K HAC1L HAC1M HAC1N HAC1O HAD1 HAE2 HAE4A HAE4B HAE5A HAE5B HAE6 HAE7 HAF1 HAF10
    HAG2 HAG3 HAG5A HAG5B HAG5C HAG11 HAG12 HAN6HS HAN6IS HAN6JS HAP1 HAP1A HAP2 HAP3
    HAP10 HAP10A HAR1 HAR3 HAR14 HAR16 HAR23 HAR24 HAR26 HAR27 HYE1G HYE1H HYE6A HYE6B
    HYE15 HYH2 HYH10 HFF1IF HAB1IF HAM5IF HAM6IF HAN6SRIF HAQ1IF HAR3RIF HAT28IF
    HAZAK1IF HAZAK5IF HAZBK1IF HAZBK5IF HAZCK1IF HAZCK5IF HYD1IF HYF2IF BMPSB1IF
    BMPSB2IF BMPSP1IF BMPSP2IF BMPTR1IF BMPTR2IF PEP6G1IF PEP6G2IF PEP6G3IF PEP6H1IF
    PEP6H2IF PEP6H3IF PEP6I1IF PEP6I2IF PEP6I3IF dropped in m=1)
(variables sdppsu6 sdpstra6 wtpfqx6 wtpqrp1 wtpqrp2 wtpqrp3 wtpqrp4 wtpqrp5 wtpqrp6
wtpqrp7 wtpqrp8 wtpqrp9 wtpqrp10 wtpqrp11 wtpqrp12 wtpqrp13 wtpqrp14 wtpqrp15 wtpqrp16
wtpqrp17 wtpqrp18 wtpqrp19 wtpqrp20 wtpqrp21 wtpqrp22 wtpqrp23 wtpqrp24 wtpqrp25
wtpqrp26 wtpqrp27 wtpqrp28 wtpqrp29 wtpqrp30 wtpqrp31 wtpqrp32 wtpqrp33 wtpqrp34
wtpqrp35 wtpqrp36 wtpqrp37 wtpqrp38 wtpqrp39 wtpqrp40 wtpqrp41 wtpqrp42 wtpqrp43
wtpqrp44 wtpqrp45 wtpqrp46 wtpqrp47 wtpqrp48 wtpqrp49 wtpqrp50 wtpqrp51 wtpqrp52 hfa7
hfa8 hfa12 hac1a hac1b hac1c hac1d hac1e hac1f hac1g hac1h hac1i hac1j hac1k hac1l hac1m
hac1n hac1o had1 hae2 hae4a hae4b hae5a hae5b hae6 hae7 haf1 haf10 hag2 hag3 hag5a hag5b
hag5c hag11 hag12 han6hs han6is han6js hap1 hap1a hap2 hap3 hap10 hap10a har1 har3 har14
har16 har23 har24 har26 har27 hye1g hye1h hye6a hye6b hye15 hyh2 hyh10 hff1if hab1if
ham5if ham6if han6srif haq1if har3rif hat28if hazak1if hazak5if hazbk1if hazbk5if
hazck1if hazck5if hyd1if hyf2if bmpsb1if bmpsb2if bmpsp1if bmpsp2if bmptr1if bmptr2if
pep6g1if pep6g2if pep6g3if pep6h1if pep6h2if pep6h3if pep6i1if pep6i2if pep6i3if added
to m=1)

< output omitted to save space >

(variables SDPPSU6 SDPSTRA6 WTPFQX6 WTPQRP1 WTPQRP2 WTPQRP3 WTPQRP4 WTPQRP5 WTPQRP6
    WTPQRP7 WTPQRP8 WTPQRP9 WTPQRP10 WTPQRP11 WTPQRP12 WTPQRP13 WTPQRP14 WTPQRP15
    WTPQRP16 WTPQRP17 WTPQRP18 WTPQRP19 WTPQRP20 WTPQRP21 WTPQRP22 WTPQRP23 WTPQRP24
    WTPQRP25 WTPQRP26 WTPQRP27 WTPQRP28 WTPQRP29 WTPQRP30 WTPQRP31 WTPQRP32 WTPQRP33
    WTPQRP34 WTPQRP35 WTPQRP36 WTPQRP37 WTPQRP38 WTPQRP39 WTPQRP40 WTPQRP41 WTPQRP42
    WTPQRP43 WTPQRP44 WTPQRP45 WTPQRP46 WTPQRP47 WTPQRP48 WTPQRP49 WTPQRP50 WTPQRP51
    WTPQRP52 HFA7 HFA8 HFA12 HAC1A HAC1B HAC1C HAC1D HAC1E HAC1F HAC1G HAC1H HAC1I HAC1J
    HAC1K HAC1L HAC1M HAC1N HAC1O HAD1 HAE2 HAE4A HAE4B HAE5A HAE5B HAE6 HAE7 HAF1 HAF10
    HAG2 HAG3 HAG5A HAG5B HAG5C HAG11 HAG12 HAN6HS HAN6IS HAN6JS HAP1 HAP1A HAP2 HAP3
    HAP10 HAP10A HAR1 HAR3 HAR14 HAR16 HAR23 HAR24 HAR26 HAR27 HYE1G HYE1H HYE6A HYE6B
    HYE15 HYH2 HYH10 HFF1IF HAB1IF HAM5IF HAM6IF HAN6SRIF HAQ1IF HAR3RIF HAT28IF
    HAZAK1IF HAZAK5IF HAZBK1IF HAZBK5IF HAZCK1IF HAZCK5IF HYD1IF HYF2IF BMPSB1IF
    BMPSB2IF BMPSP1IF BMPSP2IF BMPTR1IF BMPTR2IF PEP6G1IF PEP6G2IF PEP6G3IF PEP6H1IF
    PEP6H2IF PEP6H3IF PEP6I1IF PEP6I2IF PEP6I3IF dropped in m=5)
(variables sdppsu6 sdpstra6 wtpfqx6 wtpqrp1 wtpqrp2 wtpqrp3 wtpqrp4 wtpqrp5 wtpqrp6
wtpqrp7 wtpqrp8 wtpqrp9 wtpqrp10 wtpqrp11 wtpqrp12 wtpqrp13 wtpqrp14 wtpqrp15 wtpqrp16
wtpqrp17 wtpqrp18 wtpqrp19 wtpqrp20 wtpqrp21 wtpqrp22 wtpqrp23 wtpqrp24 wtpqrp25
wtpqrp26 wtpqrp27 wtpqrp28 wtpqrp29 wtpqrp30 wtpqrp31 wtpqrp32 wtpqrp33 wtpqrp34
wtpqrp35 wtpqrp36 wtpqrp37 wtpqrp38 wtpqrp39 wtpqrp40 wtpqrp41 wtpqrp42 wtpqrp43
wtpqrp44 wtpqrp45 wtpqrp46 wtpqrp47 wtpqrp48 wtpqrp49 wtpqrp50 wtpqrp51 wtpqrp52 hfa7
hfa8 hfa12 hac1a hac1b hac1c hac1d hac1e hac1f hac1g hac1h hac1i hac1j hac1k hac1l hac1m
hac1n hac1o had1 hae2 hae4a hae4b hae5a hae5b hae6 hae7 haf1 haf10 hag2 hag3 hag5a hag5b
hag5c hag11 hag12 han6hs han6is han6js hap1 hap1a hap2 hap3 hap10 hap10a har1 har3 har14
har16 har23 har24 har26 har27 hye1g hye1h hye6a hye6b hye15 hyh2 hyh10 hff1if hab1if
ham5if ham6if han6srif haq1if har3rif hat28if hazak1if hazak5if hazbk1if hazbk5if
hazck1if hazck5if hyd1if hyf2if bmpsb1if bmpsb2if bmpsp1if bmpsp2if bmptr1if bmptr2if
pep6g1if pep6g2if pep6g3if pep6h1if pep6h2if pep6h3if pep6i1if pep6i2if pep6i3if added
to m=5)

      pweight: wtpfqx6
          VCE: linearized
  Single unit: centered
     Strata 1: sdpstra6
         SU 1: sdppsu6
        FPC 1: <zero>

At long last, we are ready to do our analysis. Please note that not all of the commands that are available with the svy: prefix are available with the mi estimate: svy: prefix. For the examples here, we will get the mean of three variables (bmpwst, bmpht and bmpbut), and then we will run a regression.

mi estimate: svy: mean bmpwst bmpht bmpbut

Multiple-imputation estimates                  Imputations        =          5
Survey: Mean estimation                        Number of obs      =      30548

Number of strata   =        49                 Population size    =  243785748
Number of PSUs     =        98
                                               Average RVI        =     0.0242
                                               Complete DF        =         49
DF adjustment:   Small sample                  DF:     min        =      46.64
                                                       avg        =      46.67
Within VCE type:   Linearized                          max        =      46.72

------------------------------------------------------------------------------
        Mean |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      bmpwst |   84.72488   .2081347   407.07   0.000     84.30608    85.14368
       bmpht |   160.6015   .1916889   837.82   0.000     160.2158    160.9871
      bmpbut |   94.10488   .1790256   525.65   0.000     93.74465     94.4651
------------------------------------------------------------------------------

mi estimate: svy: regress bmpwst bmpht bmpbut

Multiple-imputation estimates                  Imputations        =          5
Survey: Linear regression                      Number of obs      =      30548

Number of strata   =        49                 Population size    =  243785748
Number of PSUs     =        98
                                               Average RVI        =     0.3866
                                               Complete DF        =         49
DF adjustment:   Small sample                  DF:     min        =       8.66
                                                       avg        =      20.78
                                                       max        =      43.77
Model F test:       Equal FMI                  F(   2,   13.5)    =   13019.57
Within VCE type:   Linearized                  Prob > F           =     0.0000

------------------------------------------------------------------------------
      bmpwst |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       bmpht |  -.0063002   .0129422    -0.49   0.637    -.0351702    .0225698
      bmpbut |   .9663688   .0200984    48.08   0.000     .9206272     1.01211
       _cons |  -5.203297   .5062292   -10.28   0.000    -6.223686   -4.182908
------------------------------------------------------------------------------

The preceding is intended only to give a cursory overview of the steps involved in analyzing multiply imputed survey data sets. By no means have we covered all (or even most) of the issues that can be involved. Rather, the point is that such data can be analyzed in Stata 11, and a few of the commands to do so have been illustrated. Reading over the relevant sections of the Stata 11 manuals will be very helpful and is strongly recommended.

If you are using Stata 10 or earlier, we suggest using the prefix mim: for analyzing multiply imputed data sets, although there are some other prefixes available in Stata. The prefix mim: is not part of Stata and needs to be downloaded (search mim). The NHANES data were released with five imputed data sets. Unlike SUDAAN, mim wants the data sets stacked into a single data set. We can use the mimstack command to do this (search mimstack). We need to specify unique case identifier in each data set (seqn) and the "stub" name (nh3mi) of the imputed data sets. We strongly encourage everyone to read the help file for mim before using the mim: prefix or the mimstack command. Also, before you start using the multiply imputed data, you should look at the help file for mim to see if the procedure that you want to use is supported by mim. For example, svy: tab is not supported by mim. You may also want to check periodically to see if there are any updates to mim or similar procedures. We should also note the mimstack can take some time to run, especially if you don’t have a lot of RAM on your computer.

A quick note on merging data sets

The NHANES documentation provides clear instructions on how to merge various data sets from the NHANES III collection. You need to read that very carefully. Depending on what data sets you merge, you may need to adjust the sampling weights. Again, when and how to do this is spelled out in the documentation, so please read that carefully. Also, be aware that you will likely have to modify your svyset command to work with the merged data set. In fact, the svyset command may need to be different for different models that you run from the merged data set, depending on which variables are included in the model.

Another point to remember is that what you do to merge data sets or modify sampling weights with the NHANES data will likely NOT generalize to other survey data sets. You will need to read the documentation for each data set and will probably have to follow different procedures to accomplish the same tasks with other data sets. In other words, many ways of doing things are specific to that particular data set; be careful not to over generalize.

For more information on using the NHANES data sets

There are several helpful resources for learning how to analyze the NHANES III data sets correctly. One is a listserv at http://www.cdc.gov/nchs/nhanes/nhanes_listserv.htm . There are also online tutorials at http://www.cdc.gov/nchs/tutorials/index.htm .

References

The Stata Journal, Vol. 7, No. 1, 1st Quarter 2007

Analysis of Health Surveys by Edward L. Korn and Barry I. Graubard

Sampling of Populations: Methods and Applications, Third Edition by Paul Levy and Stanley Lemeshow

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner

Sampling Techniques, Third Edition by William G. Cochran

Stata 9 Manual: Survey Data

Analyzing the NHANES III Multiply Imputed Data Set: Methods and Examples by Joseph L. Schafer (June, 2001)