Applied Survey Data Analysis in Stata 13

The purpose of this workshop is to explore some issues in the analysis of survey data using Stata 13. Before we begin, you will want to be sure that your copy of Stata is up-to-date. To do this, please type

update all

in the Stata command window and follow any instructions given. These updates include not only fixes to known bugs, but also add some new features that may be useful. I am using Stata 13.1.

Before we begin looking at examples in Stata, we will quickly review some basic issues and concepts in survey data analysis.

NOTE: Most of the commands in this seminar will work with Stata 12.

Why do we need survey data analysis software?

Regular statistical software (that is not designed for survey data) analyzes data as if the data were collected using simple random sampling. For experimental and quasi-experimental designs, this is exactly what we want. However, very few surveys use a simple random sample to collect data. Not only is it nearly impossible to do so, but it is not as efficient (either financially and statistically) as other sampling methods. When any sampling method other than simple random sampling is used, we usually need to use survey data analysis software to take into account the differences between the design that was used to collect the data and simple random sampling. This is because the sampling design affects both the calculation of the point estimates and the standard errors of those estimates. If you ignore the sampling design, e.g., if you assume simple random sampling when another type of sampling design was used, both the point estimates and their standard errors will likely be calculated incorrectly. The sampling weight will affect the calculation of the point estimate, and the stratification and/or clustering will affect the calculation of the standard errors. Ignoring the clustering will likely lead to standard errors that are underestimated, possibly leading to results that seem to be statistically significant, when in fact, they are not. The difference in point estimates and standard errors obtained using non-survey software and survey software with the design properly specified will vary from data set to data set, and even between analyses using the same data set. While it may be possible to get reasonably accurate results using non-survey software, there is no practical way to know beforehand how far off the results from non-survey software will be.

Sampling designs

Most people do not conduct their own surveys. Rather, they use survey data that some agency or company collected and made available to the public. The documentation must be read carefully to find out what kind of sampling design was used to collect the data. This is very important because many of the estimates and standard errors are calculated differently for the different sampling designs. Hence, if you mis-specify the sampling design, the point estimates and standard errors will likely be wrong.

Below are some common features of many sampling designs.

Sampling weights: There are several types of weights that can be associated with a survey. Perhaps the most common is the sampling weight. A sampling weight is a probability weight that has had one or more adjustments made to it. Both a sampling weight and a probability weight are used to weight the sample back to the population from which the sample was drawn. By definition, a probability weight is the inverse of the probability of being included in the sample due to the sampling design (except for a certainty PSU, see below). The probability weight, called a pweight in Stata, is calculated as N/n, where N = the number of elements in the population and n = the number of elements in the sample. For example, if a population has 10 elements and 3 are sampled at random with replacement, then the probability weight would be 10/3 = 3.33. In a two-stage design, the probability weight is calculated as f₁f₂, which means that the inverse of the sampling fraction for the first stage is multiplied by the inverse of the sampling fraction for the second stage. Under many sampling plans, the sum of the probability weights will equal the population total.

While many textbooks will end their discussion of probability weights here, this definition does not fully describe the sampling weights that are included with actual survey data sets. Rather, the sampling weight, which is sometimes called a "final weight," starts with the inverse of the sampling fraction, but then incorporates several other values, such as corrections for unit non-response, errors in the sampling frame (sometimes called non-coverage), and poststratification. Because these other values are included in the probability weight that is included with the data set, it is often inadvisable to modify the sampling weights, such as trying to standardize them for a particular variable, e.g., age.

PSU: This is the primary sampling unit. This is the first unit that is sampled in the design. For example, school districts from California may be sampled and then schools within districts may be sampled. The school district would be the PSU. If states from the US were sampled, and then school districts from within each state, and then schools from within each district, then states would be the PSU. One does not need to use the same sampling method at all levels of sampling. For example, probability-proportional-to-size sampling may be used at level 1 (to select states), while cluster sampling is used at level 2 (to select school districts). In the case of a simple random sample, the PSUs and the elementary units are the same. In general, accounting for the clustering in the data (i.e., using the PSUs), will increase the standard errors of the point estimates. Conversely, ignoring the PSUs will tend to yield standard errors that are too small, leading to false positives when doing significance tests.

Strata: Stratification is a method of breaking up the population into different groups, often by demographic variables such as gender, race or SES. Each element in the population must belong to one, and only one, strata. Once the strata have been defined, samples are taken from each stratum as if it were independent of all of the other strata. For example, if a sample is to be stratified on gender, men and women would be sampled independently of one another. This means that the probability weights for men will likely be different from the probability weights for the women. In most cases, you need to have two or more PSUs in each stratum. The purpose of stratification is to reduce the standard error of the estimates, and stratification works most effectively when the variance of the dependent variable is smaller within the strata than in the sample as a whole.

FPC: This is the finite population correction. This is used when the sampling fraction (the number of elements or respondents sampled relative to the population) becomes large. The FPC is used in the calculation of the standard error of the estimate. If the value of the FPC is close to 1, it will have little impact and can be safely ignored. In some survey data analysis programs, such as SUDAAN, this information will be needed if you specify that the data were collected without replacement (see below for a definition of "without replacement"). The formula for calculating the FPC is ((N-n)/(N-1))^1/2, where N is the number of elements in the population and n is the number of elements in the sample. To see the impact of the FPC for samples of various proportions, suppose that you had a population of 10,000 elements.

Sample size (n)    FPC
1                1.0000
10                .9995 
100               .9950
500               .9747
1000              .9487
5000              .7071
9000              .3162

Replicate weights: Replicate weights are a series of weight variables that are used to correct the standard errors for the sampling plan. They serve the same function as the PSU and strata variables (which are used a Taylor series linearization) to correct the standard errors of the estimates for the sampling design. Many public use data sets are now being released with replicate weights instead of PSUs and strata in an effort to more securely protect the identity of the respondents. In theory, the same standard errors will be obtained using either the PSU and strata or the replicate weights. There are different ways of creating replicate weights; the method used is determined by the sampling plan. The most common are balanced repeated and jackknife replicate weights. You will need to read the documentation for the survey data set carefully to learn what type of replicate weight is included in the data set; specifying the wrong type of replicate weight will likely lead to incorrect standard errors. For more information on replicate weights, please see Stata Library: Replicate Weights and Appendix D of the WesVar Manual by Westat, Inc. Several statistical packages, including Stata, SAS, SUDAAN, WesVar and R, allow the use of replicate weights.

Consequences of not using the design elements

Sampling design elements include the sampling weights, post-stratification weights (if provided), PSUs, strata, and replicate weights. Rarely are all of these elements included in a single public-use data set. However, ignoring the design elements that are included can often lead to inaccurate point estimates and/or inaccurate standard errors.

Sampling with and without replacement

Most samples collected in the real world are collected "without replacement". This means that once a respondent has been selected to be in the sample and has participated in the survey, that particular respondent cannot be selected again to be in the sample. Many of the calculations change depending on if a sample is collected with or without replacement. Hence, programs like SUDAAN request that you specify if a survey sampling design was implemented with our without replacement, and an FPC is used if sampling without replacement is used, even if the value of the FPC is very close to one.

Examples

For the examples in this workshop, we will use the data set from NHANES 2011-2012. The data set and documentation can be downloaded from the NHANES web site. The data files can be downloaded as SAS.xpt files. Files in this format can be read directly into Stata using the fdause command.

Reading the documentation

The first step in analyzing any survey data set is to read the documentation. With many of the public use data sets, the documentation can be quite extensive and sometimes even intimidating. Instead of trying to read the documentation "cover to cover", there are some parts you will want to focus on. First, read the Introduction. This is usually an "easy read" and will orient you to the survey. There is usually a section or chapter called something like "Sample Design and Analysis Guidelines", "Variance Estimation", etc. This is the part that tells you about the design elements included with the survey and how to use them. Some even give example code. If multiple sampling weights have been included in the data set, there will be some instruction about when to use which one. If there is a section or chapter on missing data or imputation, please read that. This will tell you how missing data were handled. You should also read any documentation regarding the specific variables that you intend to use. As we will see little later on, we will need to look at the documentation to get the value labels for the variables. This is especially important because some of the values are actually missing data codes, and you need to do something so that Stata doesn’t treat those as valid values (or you will get some very "interesting" means, totals, etc.).

Getting the data into Stata

The following commands can be used to open the NHANES data in Stata and save them in Stata format. I have also sorted the data before saving them (because I will merge the files), but this is not technically necessary.

* demographics
clear
fdause "D:DataSeminarsApplied Survey Stata 13demo_g.xpt"
sort seqn
save "D:DataSeminarsApplied Survey Stata 13demo_g", replace

The do-file that imports the data, merges the files and recodes the variables can be found here. More variables are recoded in the do-file than are used in this presentation.

The variables

We will use about a dozen different variables in the examples in this workshop. Below is a brief summary of them. Some of the variables have been recoded to be binary variables (values of 2 recoded to a value of 0). The count of missing observations includes values truly missing as well as refused and don’t know.

ridageyr – Age in years at exam – recoded; range of values: 0 – 79 are actual values, 80 = 80+ years of age

pad630 – How much time do you spend doing moderate-intensity activities on a type work day?; range of values: 10-960, 7053 missing observations

hsq496 – During the past 30 days, for about how many days have you felt worried, tense or anxious?; range of values: 0-30; 3073 missing observations

female – Recode of the variable riagendr; 0 = male, 1 = female; no missing observations

dmdborn4 – Country of birth; 1 = born in the United States, 0 = otherwise; 5 missing observations

dmdmartl – Marital status; 1 = married, 2 = widowed, 3 = divorced, 4 = separated, 5 = never married, 6 = living with partner; 4203 missing observations

dmdeduc2 – Education level of adults aged 20+ years; 1 = less than 9th grade, 2 = 9-11th grade, 3 = high school graduate, GED or equivalent, 4 = some college or AA degree, 5 = college graduate or above; 4201 missing observations

pad675 – How much time do you spend doing moderate-intensity sports, fitness, or recreation activities on a typical day?; range of values: 10-600; 6220 missing observations

hsq571 – During the past 12 months, have you donated blood?; 0 = no, 1 = yes; 3673 missing observations

pad680 – How much time do you usually spend sitting on a typical day?; range of values: 0-1380; 2365 missing observations

paq665 – Do you do any moderate-intensity sports, fitness or recreational activities that cause a small increase in breathing or heart rate at least 10 minutes continually?; 0 = no, 1 = yes; 2329 missing observations

hsd010 – Would you say that your general health is…; 1 = excellent, 2 = very good, 3 = good, 4 = fair, 5 = poor; 3064 missing observations

The svyset command

Before we can start our analyses, we need to issue the svyset command. The svyset command tells Stata about the design elements in the survey. Once this command has been issued, all you need to do for your analyses is use the svy: prefix before each command. Because the 2011-2012 NHANES data were released with a sampling weight (wtint2yr), a PSU variable (sdmvpsu) and a strata variable (sdmvstra), we will use these our svyset command. The svyset command looks like this:

use "D:DataSeminarsApplied Survey Stata 13nhanes2012_merged", clear
svyset sdmvpsu [pw = wtint2yr], strata(sdmvstra) singleunit(centered)

      pweight: wtint2yr
          VCE: linearized
  Single unit: centered
     Strata 1: sdmvstra
         SU 1: sdmvpsu
        FPC 1: <zero>

The singleunit option was added in Stata 10. This option allows for different ways of handling a single PSU in a stratum. If you use the default option, missing, then you will get no standard errors when Stata encounters a single PSU in a stratum. This can happen as a result of missing data or running an analysis on a subpopulation. There are three other options. One is certainty, meaning that the singleton PSUs be treated as certainty PSUs; certainty PSUs are PSUs that were selected into the sample with a probability of 1 (in other words, these PSUs were certain to be in the sample) and do not contribute to the standard error. The scaled option gives a scaled version of the certainty option. The scaling factor comes from using the average of the variances from the strata with multiple sampling units for each stratum with one PSU. The centered option centers strata with one sampling unit at the grand mean instead of the stratum mean. Now that we have issued the svyset command, we can use the svydescribe command to get information on the strata and PSUs.

svydescribe

Survey: Describing stage 1 sampling units

      pweight: wtint2yr
          VCE: linearized
  Single unit: centered
     Strata 1: sdmvstra
         SU 1: sdmvpsu
        FPC 1: <zero>

                                      #Obs per Unit
                              ----------------------------
Stratum    #Units     #Obs      min       mean      max   
--------  --------  --------  --------  --------  --------
      90         3       862       233     287.3       351
      91         3       998       309     332.7       356
      92         3       875       244     291.7       328
      93         2       602       276     301.0       326
      94         2       688       322     344.0       366
      95         2       722       348     361.0       374
      96         2       676       336     338.0       340
      97         2       608       292     304.0       316
      98         2       708       320     354.0       388
      99         2       682       320     341.0       362
     100         2       700       343     350.0       357
     101         2       715       357     357.5       358
     102         2       624       270     312.0       354
     103         2       296       140     148.0       156
--------  --------  --------  --------  --------  --------
      14        31      9756       140     314.7       388

The output above tells us that there are 14 strata with two or three PSUs in each. There are 31 PSUs. There are a different number of observations in each PSU and in each strata. There are a minimum of140 observations, a maximum of 388 observations and an average of 315 observations in the PSUs.

Descriptive statistics

We will start by calculating some descriptive statistics of some of the continuous variables. We can use the svy: mean command to get the mean of continuous variables, and we can follow that command with the estat sd command to get the standard deviation or variance of the variable.

svy: mean ridageyr
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      14       Number of obs    =       9756
Number of PSUs   =      31       Population size  =  306590681
                                 Design df        =         17

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
    ridageyr |   37.18519   .6964767      35.71576    38.65463
--------------------------------------------------------------

estat sd

-------------------------------------
             |       Mean   Std. Dev.
-------------+-----------------------
    ridageyr |   37.18519    22.36971
-------------------------------------

estat sd, var

-------------------------------------
             |       Mean    Variance
-------------+-----------------------
    ridageyr |   37.18519    500.4039
-------------------------------------

svy: mean pad630
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      14        Number of obs    =      2054
Number of PSUs   =      31        Population size  =  88768571
                                  Design df        =        17

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
      pad630 |   139.8874    5.57906      128.1166    151.6582
--------------------------------------------------------------

svy: mean hsq496
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      14       Number of obs    =       5883
Number of PSUs   =      31       Population size  =  225667462
                                 Design df        =         17

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
      hsq496 |   5.383908   .1899507      4.983147    5.784669
--------------------------------------------------------------

Notice that a different number of observations were used in each analysis. This is important, because if you use these three variables in the same call to svy: mean, you will get different estimates of the means for each of the variables. This is because both pad630 and hsq496 have missing values. However, these variables are not missing on the same observations: Both pad630 and hsq496 are missing on the same 3666 observations, but pad630 has missing values on an additional 4036 observations, and hsq496 has missing values on 207 observations. There are only 1847 observations that have valid values for all three variables, and those are used in the calculations of the means when all of the variables are used in the same call to svy: mean.

svy: mean ridageyr pad630 hsq496
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      14        Number of obs    =      1847
Number of PSUs   =      31        Population size  =  81681482
                                  Design df        =        17

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
    ridageyr |   41.97793   .8211706      40.24541    43.71045
      pad630 |   139.7228   5.713696       127.668    151.7777
      hsq496 |    5.37395   .2781225      4.787162    5.960737
--------------------------------------------------------------

Let’s get some descriptive statistics on some of the binary variables, such as female and dmdborn4.

svy: mean female
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      14       Number of obs    =       9756
Number of PSUs   =      31       Population size  =  306590681
                                 Design df        =         17

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
      female |   .5119524   .0064398      .4983657    .5255392
--------------------------------------------------------------

estat sd

-------------------------------------
             |       Mean   Std. Dev.
-------------+-----------------------
      female |   .5119524    .4998827
-------------------------------------

svy: mean dmdborn4
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      14       Number of obs    =       9751
Number of PSUs   =      31       Population size  =  306485936
                                 Design df        =         17

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
    dmdborn4 |   .8499362   .0157561      .8166938    .8831787
--------------------------------------------------------------

estat sd

-------------------------------------
             |       Mean   Std. Dev.
-------------+-----------------------
    dmdborn4 |   .8499362    .3571522
-------------------------------------

Taking the mean of a variable that is coded 0/1 gives the proportion of 1s. The output above indicates that approximately 51.2% of the observations in our data set are from females; 84.99% of respondents were born in the United States.

Of course, the svy: tab command can also be used with binary variables. The proportion of .512 matches the .5119524 that we found with the svy: mean command.

svy: tab female
(running tabulate on estimation sample)

Number of strata   =        14                 Number of obs      =       9756
Number of PSUs     =        31                 Population size    =  306590681
                                               Design df          =         17

-----------------------
RECODE of |
riagendr  |
(Gender)  | proportions
----------+------------
     male |        .488
   female |        .512
          | 
    Total |           1
-----------------------
  Key:  proportions  =  cell proportions

By default, svy: tab gives proportions. If you would prefer to see the actual counts, you will need to use the count option. Oftentimes, the counts are too large for the display space in the table, so other options, such cellwidth and format need to be used to display the counts. The missing option is often useful if the variable has any missing values (the variable female does not).

svy: tab female, missing count cellwidth(15) format(%15.2g)
(running tabulate on estimation sample)

Number of strata   =        14                 Number of obs      =       9756
Number of PSUs     =        31                 Population size    =  306590681
                                               Design df          =         17

---------------------------
RECODE of |
riagendr  |
(Gender)  |           count
----------+----------------
     male |       149630839
   female |       156959842
          | 
    Total |       306590681
---------------------------
  Key:  count            =  weighted counts

svy: tab female dmdborn4, missing count cellwidth(15) format(%15.2g)
(running tabulate on estimation sample)

Number of strata   =        14                 Number of obs      =       9756
Number of PSUs     =        31                 Population size    =  306590681
                                               Design df          =         17

------------------------------------------------------------------------------
RECODE of |
riagendr  |                     Country of birth - recode                     
(Gender)  |  born elsewhere       born in US                .            Total
----------+-------------------------------------------------------------------
     male |        22449131        127102676            79032        149630839
   female |        23543299        133390830            25713        156959842
          | 
    Total |        45992430        260493506           104745        306590681
------------------------------------------------------------------------------
  Key:  weighted counts

  Pearson:
    Uncorrected   chi2(2)         =    0.9477
    Design-based  F(1.78, 30.34)  =    0.5244     P = 0.5769

Using the col option with svy: tab will give the column proportions. As you can see, the values in the column "Total" are the same as those from svy: tab female.

svy: tab female dmdborn4, col 
(running tabulate on estimation sample)

Number of strata   =        14                 Number of obs      =       9751
Number of PSUs     =        31                 Population size    =  306485936
                                               Design df          =         17

----------------------------------------
RECODE of |
riagendr  |  Country of birth - recode  
(Gender)  | born els   born in     Total
----------+-----------------------------
     male |    .4881     .4879      .488
   female |    .5119     .5121      .512
          | 
    Total |        1         1         1
----------------------------------------
  Key:  column proportions

  Pearson:
    Uncorrected   chi2(1)         =    0.0002
    Design-based  F(1, 17)        =    0.0002     P = 0.9891

Of course, the svy: proportion command can also be used to get proportions.

svy: proportion female
(running proportion on estimation sample)

Survey: Proportion estimation

Number of strata =      14       Number of obs    =       9756
Number of PSUs   =      31       Population size  =  306590681
                                 Design df        =         17

--------------------------------------------------------------
             |             Linearized
             | Proportion   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
female       |
        male |   .4880476   .0064398       .474473    .5016398
      female |   .5119524   .0064398      .4983602     .525527
--------------------------------------------------------------

Using the estat command after the svy: mean command also allows you to get the design effects, misspecification effects, unweighted and weighted sample sizes, or the coefficient of variation. The Deff and the Deft are types of design effects, which tell you about the efficiency of your sample. The Deff is a ratio of two variances. In the numerator we have the variance estimate from the current sample (including all of its design elements), and in the denominator we have the variance from a hypothetical sample of the same size drawn as an SRS. In other words, the Deff tells you how efficient your sample is compared to an SRS of equal size. If the Deff is less than 1, your sample is more efficient than SRS; usually the Deff is greater than 1. The Deft is the ratio of two standard error estimates. Again, the numerator is the standard error estimate from the current sample. The denominator is a hypothetical SRS (with replacement) standard error from a sample of the same size as the current sample. You can also use the meff and the meft option to get the misspecification effects. Misspecification effects are a ratio of the variance estimate from the current analysis to a hypothetical variance estimated from a misspecified model. Please see the Stata documentation for more details on how these are calculated. The coefficient of variation is the ratio of the standard error to the mean, multiplied by 100% (see page 33 of the Stata 13 svy manual). It is an indication of the variability relative to the mean in the population and is not affected by the unit of measurement of the variable.

svy: mean ridageyr
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      14       Number of obs    =       9756
Number of PSUs   =      31       Population size  =  306590681
                                 Design df        =         17

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
    ridageyr |   37.18519   .6964767      35.71576    38.65463
--------------------------------------------------------------

estat effects

----------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.       DEFF      DEFT
-------------+--------------------------------------------
    ridageyr |   37.18519   .6964767     9.45724   3.07526
----------------------------------------------------------

estat effects, deff deft meff meft

------------------------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.       DEFF      DEFT      MEFF      MEFT
-------------+----------------------------------------------------------------
    ridageyr |   37.18519   .6964767     9.45724   3.07526   7.83352   2.79884
------------------------------------------------------------------------------

estat size

----------------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.              Obs            Size
-------------+--------------------------------------------------------
    ridageyr |   37.18519   .6964767              9756   306590680.995
----------------------------------------------------------------------

estat cv

------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     CV (%)
-------------+----------------------------------
    ridageyr |   37.18519   .6964767     1.87299
------------------------------------------------

display (.6964767/37.18519)*100
1.8729949

Analysis of subpopulations

Before we continue with our descriptive statistics, we should pause to discuss the analysis of subpopulations. The analysis of subpopulations is one place where survey data and experimental data are quite different. If you have data from an experiment (or quasi-experiment), and you want to analyze the responses from, say, just the women, or just people over age 50, you can just delete the unwanted cases from the data set or use the by: prefix. Survey data are different. With survey data, you (almost) never get to delete any cases from the data set, even if you will never use them in any of your analyses. Because of the way the by: prefix works, you usually don’t use it with survey data either. Instead, Stata has provided two options that allow you to correctly analyze subpopulations of your survey data. These options are subpop and over. The subpop option is sort of like deleting unwanted cases (without really deleting them, of course), and the over option is very similar to by: processing. We will start with some examples of the subpop option.

First, however, let’s take a second to see why deleting cases from a survey data set can be so problematic. If the data set is subset (meaning that observations not to be included in the subpopulation are deleted from the data set), the standard errors of the estimates cannot be calculated correctly. When the subpopulation option(s) is used, only the cases defined by the subpopulation are used in the calculation of the estimate, but all cases are used in the calculation of the standard errors. For more information on this issue, please see Sampling Techniques, Third Edition by William G. Cochran (1977) and Small Area Estimation by J. N. K. Rao (2003). Also, if you look in the Stata 13 svy manual, you will find an entire section (pages 58-63) dedicated to the analysis of subpopulations. The formulas for using both if and subpop are given, along with an explanation of how they are different. If you look at the help for any svy: command, you will see the same warning:

Warning:  Use of if or in restrictions will not produce correct variance
      estimates for subpopulations in many cases.  To compute estimates for
      subpopulations, use the subpop() option.  The full specification for subpop()
      is
            subpop([varname] [if])

The subpop option on the svy: prefix is used with binary variables. (Technically, all cases coded as not 0 and not missing are part of the subpopulation; this means that if your subpopulation variable has values of 1 and 2, all of the observations will be included in the subpopulation unless you use a different syntax.) You may need to create a binary variable (using the generate command) in which all of the observations be to included in the subpopulation are coded as 1 and all other observations are coded as 0. For our example, we will use the variable female for our subpopulation variable, so that only females will be included in calculation of the point estimate.

svy: mean ridageyr
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      14       Number of obs    =       9756
Number of PSUs   =      31       Population size  =  306590681
                                 Design df        =         17

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
    ridageyr |   37.18519   .6964767      35.71576    38.65463
--------------------------------------------------------------

svy, subpop(female): mean ridageyr
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      14       Number of obs    =       9756
Number of PSUs   =      31       Population size  =  306590681
                                 Subpop. no. obs  =       4900
                                 Subpop. size     =  156959842
                                 Design df        =         17

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
    ridageyr |   38.09657   .6713502      36.68014    39.51299
--------------------------------------------------------------

To get the mean for the males, we can specify the subpopulation as the variable female not equal to 1 (meaning the observations that are coded 0).

svy, subpop(if female != 1): mean ridageyr
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      14       Number of obs    =       9756
Number of PSUs   =      31       Population size  =  306590681
                                 Subpop. no. obs  =       4856
                                 Subpop. size     =  149630839
                                 Design df        =         17

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
    ridageyr |   36.22918   .8431945       34.4502    38.00817
--------------------------------------------------------------

We can also use the over option, which will give the results for each level of the categorical variable listed. The over option is available only for svy: mean, svy: proportion, svy: ratio and svy: total.

svy, over(female): mean ridageyr
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      14       Number of obs    =       9756
Number of PSUs   =      31       Population size  =  306590681
                                 Design df        =         17

         male: female = male
       female: female = female

--------------------------------------------------------------
             |             Linearized
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
ridageyr     |
        male |   36.22918   .8431945       34.4502    38.00817
      female |   38.09657   .6713502      36.68014    39.51299
--------------------------------------------------------------

We can include more than one variable with the over option; this will give us results for every combination of the variables listed.

svy, over(dmdmartl female): mean pad630
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      14        Number of obs    =      1705
Number of PSUs   =      31        Population size  =  77846933
                                  Design df        =        17

         Over: dmdmartl female
    _subpop_1: married male
    _subpop_2: married female
    _subpop_3: widowed male
    _subpop_4: widowed female
    _subpop_5: divorced male
    _subpop_6: divorced female
    _subpop_7: separated male
    _subpop_8: separated female
    _subpop_9: never married male
   _subpop_10: never married female
   _subpop_11: living with partner male
   _subpop_12: living with partner female

--------------------------------------------------------------
             |             Linearized
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
pad630       |
   _subpop_1 |   156.6941   8.765612      138.2002    175.1879
   _subpop_2 |   118.6255   9.590405      98.39152    138.8595
   _subpop_3 |   132.6111   20.77576      88.77813    176.4442
   _subpop_4 |   120.9569   11.10439      97.52869    144.3851
   _subpop_5 |    209.423   27.28619      151.8541    266.9918
   _subpop_6 |   124.6343   15.50591      91.91973    157.3489
   _subpop_7 |   187.9685   38.86106      105.9789    269.9582
   _subpop_8 |   200.9333   52.49986      90.16828    311.6983
   _subpop_9 |   146.4231   12.35329      120.3599    172.4862
  _subpop_10 |   131.7091   13.30558      103.6368    159.7814
  _subpop_11 |    201.422   13.41895      173.1105    229.7335
  _subpop_12 |   122.5736   14.88511      91.16879    153.9785
--------------------------------------------------------------

We can also use the subpop option and the over option together. In the example below, we get the mean of pad630 for females in each combination of dmdmart1 and dmdecuc2 (30 levels).

svy, subpop(female): mean pad630, over(dmdmartl dmdeduc2) 
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      14       Number of obs    =       5619
Number of PSUs   =      31       Population size  =  185989071
                                 Subpop. no. obs  =        763
                                 Subpop. size     = 36358232.8
                                 Design df        =         17

         Over: dmdmartl dmdeduc2
    _subpop_1: married less than 9th grade
    _subpop_2: married no hs diploma
    _subpop_3: married hs grad or GED
    _subpop_4: married some college or AA degre
    _subpop_5: married college grad or above
    _subpop_6: widowed less than 9th grade
    _subpop_7: widowed no hs diploma
    _subpop_8: widowed hs grad or GED
    _subpop_9: widowed some college or AA degre
   _subpop_10: widowed college grad or above
   _subpop_11: divorced less than 9th grade
   _subpop_12: divorced no hs diploma
   _subpop_13: divorced hs grad or GED
   _subpop_14: divorced some college or AA degr
   _subpop_15: divorced college grad or above
   _subpop_16: separated less than 9th grade
   _subpop_17: separated no hs diploma
   _subpop_18: separated hs grad or GED
   _subpop_19: separated some college or AA deg
   _subpop_20: separated college grad or above
   _subpop_21: never married less than 9th grad
   _subpop_22: never married no hs diploma
   _subpop_23: never married hs grad or GED
   _subpop_24: never married some college or AA
   _subpop_25: never married college grad or ab
   _subpop_26: living with partner less than 9t
   _subpop_27: living with partner no hs diplom
   _subpop_28: living with partner hs grad or G
   _subpop_29: living with partner some college
   _subpop_30: living with partner college grad

--------------------------------------------------------------
             |             Linearized
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
pad630       |
   _subpop_1 |   123.1873   33.99413      51.46593    194.9086
   _subpop_2 |   143.3616   14.00536      113.8129    172.9103
   _subpop_3 |    111.808   12.92772      84.53291    139.0831
   _subpop_4 |   146.2888   22.89246      97.98991    194.5877
   _subpop_5 |   88.91933    8.30694      71.39322    106.4454
   _subpop_6 |   176.0959   72.50964      23.11395    329.0779
   _subpop_7 |   90.60535   17.21293      54.28923    126.9215
   _subpop_8 |   110.4805   37.21189      31.97029    188.9907
   _subpop_9 |   101.8385   11.74256      77.06389    126.6132
  _subpop_10 |   260.9962   63.18704      127.6832    394.3092
  _subpop_11 |   54.28887   15.49709      21.59287    86.98487
  _subpop_12 |   163.2022   47.28731      63.43472    262.9697
  _subpop_13 |   145.5074   25.43471      91.84482    199.1699
  _subpop_14 |   129.5883   28.49053      69.47856    189.6981
  _subpop_15 |   76.00663   11.32266       52.1179    99.89535
  _subpop_16 |   169.3599   80.57609     -.6408196    339.3606
  _subpop_17 |   165.8482   46.57393      67.58582    264.1106
  _subpop_18 |   302.0225   117.2982      54.54487    549.5001
  _subpop_19 |   142.5322   26.01105      87.65368    197.4107
  _subpop_20 |   117.9216   34.79595      44.50859    191.3347
  _subpop_21 |   309.9912    93.1523      113.4571    506.5254
  _subpop_22 |   166.3335   46.45898      68.31357    264.3533
  _subpop_23 |   190.2853   55.64751      72.87928    307.6913
  _subpop_24 |   109.5492   15.62974      76.57336    142.5251
  _subpop_25 |   126.5192   19.29269      85.81517    167.2232
  _subpop_26 |        120          .             .           .
  _subpop_27 |    144.995   24.39718       93.5215    196.4686
  _subpop_28 |   98.81543   21.16077      54.17012    143.4607
  _subpop_29 |   152.5923    23.7468      102.4909    202.6936
  _subpop_30 |    89.0113   23.17592      40.11439    137.9082
--------------------------------------------------------------

For subpopulation 26, no standard error or confidence intervals are given. This may be cause there are very few observations at that level. To see if this is the cause, we can use the estat size command. The unweighted number of cases is given in the column titles "Obs", and the weighted (or estimated subpopulation size) is given in the column titled "Size".

estat size

         Over: dmdmartl dmdeduc2
    _subpop_1: married less than 9th grade
    _subpop_2: married no hs diploma
    _subpop_3: married hs grad or ged
    _subpop_4: married some college or AA degre
    _subpop_5: married college grad or above
    _subpop_6: widowed less than 9th grade
    _subpop_7: widowed no hs diploma
    _subpop_8: widowed hs grad or ged
    _subpop_9: widowed some college or AA degre
   _subpop_10: widowed college grad or above
   _subpop_11: divorced less than 9th grade
   _subpop_12: divorced no hs diplom
   _subpop_13: divorced hs grad or ged
   _subpop_14: divorced some college or AA degr
   _subpop_15: divorced college grad or above
   _subpop_16: separated less than 9th grade
   _subpop_17: separated no hs diploma
   _subpop_18: separated hs grad or ged
   _subpop_19: separated some college or AA deg
   _subpop_20: separated college grad or above
   _subpop_21: never married less than 9th grad
   _subpop_22: never married no hs diplom
   _subpop_23: never married hs grad or ged
   _subpop_24: never married some college or AA
   _subpop_25: never married college grad or ab
   _subpop_26: living with partner less than 9t
   _subpop_27: living with partner no hs diplom
   _subpop_28: living with partner hs grad or g
   _subpop_29: living with partner some college
   _subpop_30: living with partner college grad

----------------------------------------------------------------------
             |             Linearized
        Over |       Mean   Std. Err.              Obs            Size
-------------+--------------------------------------------------------
pad630       |
   _subpop_1 |   123.1873   33.99413                14   282303.798783
   _subpop_2 |   143.3616   14.00536                36    1493852.2679
   _subpop_3 |    111.808   12.92772                60   2807704.21163
   _subpop_4 |   146.2888   22.89246               121   7032757.93552
   _subpop_5 |   88.91933    8.30694               116   7192028.73485
   _subpop_6 |   176.0959   72.50964                 5   111127.083356
   _subpop_7 |   90.60535   17.21293                17   562709.323876
   _subpop_8 |   110.4805   37.21189                15    596030.21488
   _subpop_9 |   101.8385   11.74256                27    1272472.7973
  _subpop_10 |   260.9962   63.18704                 9   296513.579822
  _subpop_11 |   54.28887   15.49709                 3    78128.930803
  _subpop_12 |   163.2022   47.28731                10   414794.852531
  _subpop_13 |   145.5074   25.43471                23   1215808.11627
  _subpop_14 |   129.5883   28.49053                39   1738510.76294
  _subpop_15 |   76.00663   11.32266                14   914950.601057
  _subpop_16 |   169.3599   80.57609                 4    88739.706649
  _subpop_17 |   165.8482   46.57393                10   262282.787221
  _subpop_18 |   302.0225   117.2982                10   327045.836107
  _subpop_19 |   142.5322   26.01105                10   247858.703249
  _subpop_20 |   117.9216   34.79595                 4    79284.608695
  _subpop_21 |   309.9912    93.1523                 4   104748.202272
  _subpop_22 |   166.3335   46.45898                10   343955.585817
  _subpop_23 |   190.2853   55.64751                19    748235.70547
  _subpop_24 |   109.5492   15.62974                79   2939581.96271
  _subpop_25 |   126.5192   19.29269                37   1786563.47803
  _subpop_26 |        120          0                 2    83816.117097
  _subpop_27 |    144.995   24.39718                 9    312110.74153
  _subpop_28 |   98.81543   21.16077                17   872758.146424
  _subpop_29 |   152.5923    23.7468                26   1355189.31349
  _subpop_30 |    89.0113   23.17592                13   796368.717456
----------------------------------------------------------------------

list pad630 if female == 1 & dmdmartl == 6 & dmdeduc2 == 1

      +--------+
      | pad630 |
      |--------|
 344. |    120 |
 479. |      . |
1339. |      . |
1962. |      . |
1987. |    120 |
      |--------|
2075. |      . |
2148. |      . |
2178. |      . |
2631. |      . |
2972. |      . |
      |--------|
3148. |      . |
4118. |      . |
4595. |      . |
6610. |      . |
7064. |      . |
      |--------|
7112. |      . |
7214. |      . |
7709. |      . |
7829. |      . |
8095. |      . |
      |--------|
8479. |      . |
      +--------+

By using the list command, we can see that there are only two cases that have a valid value for pad630 in subpopulation 26, and both of those values are 120. This is why no standard error can be estimated.

The lincom command can be used to make comparisons between subpopulations. In the first example below, we will compare the means for males and females for hsq496 (number of days feeling anxious). We can use the display command to see how the point estimate in the output of the lincom command is calculated. The value of using the lincom command is that the standard error of point estimate is also calculated, as well as the test statistic, p-value and 95% confidence interval. In the following examples, we will compare the mean number of days feeling anxious between those who are married to those who are living with a partner, and those who are married to those who are widowed. Please see page 109 of the Stata 13 svy manual (first example in the section on survey postestimation) for more information on using lincom after the svy, subpop(): mean command.

svy, over(female): mean hsq496
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      14       Number of obs    =       5883
Number of PSUs   =      31       Population size  =  225667462
                                 Design df        =         17

         male: female = male
       female: female = female

--------------------------------------------------------------
             |             Linearized
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
hsq496       |
        male |   4.589723   .1956524      4.176933    5.002514
      female |   6.153479   .2675166      5.589069     6.71789
--------------------------------------------------------------

lincom [hsq496]male - [hsq496]female

 ( 1)  [hsq496]male - [hsq496]female = 0

------------------------------------------------------------------------------
        Mean |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         (1) |  -1.563756   .2702582    -5.79   0.000    -2.133951   -.9935614
------------------------------------------------------------------------------

display 4.589723 - 6.153479
-1.563756

svy, over(dmdmartl): mean hsq496
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      14       Number of obs    =       4692
Number of PSUs   =      31       Population size  =  193938169
                                 Design df        =         17

      married: dmdmartl = married
      widowed: dmdmartl = widowed
     divorced: dmdmartl = divorced
    separated: dmdmartl = separated
    _subpop_5: dmdmartl = never married
    _subpop_6: dmdmartl = living with partner

--------------------------------------------------------------
             |             Linearized
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
hsq496       |
     married |   4.915051   .2864456      4.310704    5.519399
     widowed |   5.450213   .3868792      4.633969    6.266457
    divorced |   7.514127    .473958      6.514163    8.514091
   separated |   7.231938   1.136924      4.833238    9.630637
   _subpop_5 |   6.223865    .408766      5.361444    7.086286
   _subpop_6 |   7.266146   .6322425       5.93223    8.600061
--------------------------------------------------------------

lincom [hsq496]married - [hsq496]_subpop_6

 ( 1)  [hsq496]married - [hsq496]_subpop_6 = 0

------------------------------------------------------------------------------
        Mean |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         (1) |  -2.351094   .6468052    -3.63   0.002    -3.715734   -.9864544
------------------------------------------------------------------------------

lincom [hsq496]married - [hsq496]widowed

 ( 1)  [hsq496]married - [hsq496]widowed = 0

------------------------------------------------------------------------------
        Mean |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         (1) |  -.5351614   .5175658    -1.03   0.316     -1.62713    .5568069
------------------------------------------------------------------------------

Graphing with continuous variables

We can also get some descriptive graphs of our variables. For a continuous variable, you may want a histogram. However, the histogram command will only accept a frequency weight, which, by definition, can have only integer values. A suggestion by Heeringa, West and Berglund (2010, pages 121-122) is to simply use the integer part of the sampling weight. We can create a frequency weight from our sampling weight using the generate command with the int function.

gen int_wtint2yr = int(wtint2yr)
histogram pad630 [fw = int_wtint2yr], bin(20)

histogram ridageyr [fw = int_wtint2yr], bin(20) normal

We can make box plots and use the sampling weight that is provided in the data set.

graph box hsq496 [pw = wtint2yr]

graph box hsq496 [pw = wtint2yr], by(female) ylabel(0(5)30)

 
* the line is the box plot represents the median, not the mean
svy, over(female): mean hsq496
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      14       Number of obs    =       5883
Number of PSUs   =      31       Population size  =  225667462
                                 Design df        =         17

         male: female = male
       female: female = female

--------------------------------------------------------------
             |             Linearized
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
hsq496       |
        male |   4.589723   .1956524      4.176933    5.002514
      female |   6.153479   .2675166      5.589069     6.71789
--------------------------------------------------------------

estat sd

         male: female = male
       female: female = female

-------------------------------------
        Over |       Mean   Std. Dev.
-------------+-----------------------
hsq496       |
        male |   4.589723    8.408553
      female |   6.153479    9.096215
-------------------------------------

We can use the svy: total command to get totals. The values in the output can get to be very large, so we can use the estimates table command to see them. We can also use the matrix list command.

svy: total pad630
(running total on estimation sample)

Survey: Total estimation

Number of strata =      14        Number of obs    =      2054
Number of PSUs   =      31        Population size  =  88768571
                                  Design df        =        17

--------------------------------------------------------------
             |             Linearized
             |      Total   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
      pad630 |   1.24e+10   1.07e+09      1.02e+10    1.47e+10
--------------------------------------------------------------

estimates table, b(%15.2f) se(%13.2f)

--------------------------------
    Variable |     active       
-------------+------------------
      pad630 |  12417602483.56  
             |   1065724292.32  
--------------------------------
                    legend: b/se

svy: total pad630
(running total on estimation sample)

Survey: Total estimation

Number of strata =      14        Number of obs    =      2054
Number of PSUs   =      31        Population size  =  88768571
                                  Design df        =        17

--------------------------------------------------------------
             |             Linearized
             |      Total   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
      pad630 |   1.24e+10   1.07e+09      1.02e+10    1.47e+10
--------------------------------------------------------------

matlist e(b), format(%15.2f)

             |          pad630 
-------------+-----------------
          y1 |  12417602483.56 

svy, over(female): total pad630
(running total on estimation sample)

Survey: Total estimation

Number of strata =      14        Number of obs    =      2054
Number of PSUs   =      31        Population size  =  88768571
                                  Design df        =        17

         male: female = male
       female: female = female

--------------------------------------------------------------
             |             Linearized
        Over |      Total   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
pad630       |
        male |   7.41e+09   6.37e+08      6.06e+09    8.75e+09
      female |   5.01e+09   4.72e+08      4.01e+09    6.00e+09
--------------------------------------------------------------

estimates table, b(%15.2f) se(%13.2f)

--------------------------------
    Variable |     active       
-------------+------------------
        male |   7408679626.25  
             |    637463721.35  
      female |   5008922857.30  
             |    472105366.03  
--------------------------------
                    legend: b/se

Relationships between two continuous variables

Let’s look at some bivariate relationships. As of Stata 13.1, we cannot get correlations with survey data (but you can with SUDAAN 11). The graphical version of a correlation is a scatterplot, which we show below. We will include the fit line.

svy: mean pad630 pad675
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      14        Number of obs    =      1030
Number of PSUs   =      31        Population size  =  48053490
                                  Design df        =        17

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
      pad630 |   129.7198   5.723011      117.6453    141.7943
      pad675 |   72.48499   2.365431      67.49436    77.47561
--------------------------------------------------------------

twoway (scatter pad630 pad675) (lfit pad630 pad675 [pw = wtint2yr]), ///
title("minutes of moderate intensity work" ///
"v. minutes of moderate recreational activities")

Descriptive statistics with categorical variables

Let’s get some descriptive statistics with categorical variables. We can use the svy: tab and svy: proportion commands. We will use the marital status variable, dmdmart1, which has six levels. When used with no options, the output from the svy: tab command will give us the same information as the svy: proportion command.

svy: tab dmdmartl
(running tabulate on estimation sample)

Number of strata   =        14                 Number of obs      =       5553
Number of PSUs     =        31                 Population size    =  223858353
                                               Design df          =         17

-----------------------
Marital   |
Status    | proportions
----------+------------
  married |       .5308
  widowed |       .0562
 divorced |       .1069
 separate |        .024
 never ma |       .1987
 living w |       .0834
          | 
    Total |           1
-----------------------
  Key:  proportions  =  cell proportions

svy: proportion dmdmartl
(running proportion on estimation sample)

Survey: Proportion estimation

Number of strata =      14       Number of obs    =       5553
Number of PSUs   =      31       Population size  =  223858353
                                 Design df        =         17

      _prop_5: dmdmartl = never married
      _prop_6: dmdmartl = living with partner

--------------------------------------------------------------
             |             Linearized
             | Proportion   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
dmdmartl     |
     married |   .5307919   .0204953      .4874276    .5736962
     widowed |   .0562251   .0031996      .0498433    .0633695
    divorced |   .1068817    .007248      .0925245    .1231643
   separated |   .0239747   .0031613      .0181365    .0316316
     _prop_5 |   .1986955   .0236294      .1534768    .2532512
     _prop_6 |   .0834312   .0064732       .070751    .0981439
--------------------------------------------------------------

Let’s use some options with the svy: tab command so that we can see the estimated number of people in each category.

svy: tab dmdmartl, count cellwidth(12) format(%12.2g)
(running tabulate on estimation sample)

Number of strata   =        14                 Number of obs      =       5553
Number of PSUs     =        31                 Population size    =  223858353
                                               Design df          =         17

------------------------
Marital   |
Status    |        count
----------+-------------
  married |    118822198
  widowed |     12586462
 divorced |     23926362
 separate |      5366932
 never ma |     44479637
 living w |     18676762
          | 
    Total |    223858353
------------------------
  Key:  count         =  weighted counts

There are many options that can by used with svy: tab. Please see the Stata help file for the svy: tabulate command for a complete listing and description of each option. Only five items can be displayed at once, and the ci option counts as two items.

svy: tab dmdmartl, cell count obs cellwidth(12) format(%12.2g) 
(running tabulate on estimation sample)

Number of strata   =        14                 Number of obs      =       5553
Number of PSUs     =        31                 Population size    =  223858353
                                               Design df          =         17

----------------------------------------------------
Marital   |
Status    |        count   proportions           obs
----------+-----------------------------------------
  married |    118822198           .53          2683
  widowed |     12586462          .056           467
 divorced |     23926362           .11           571
 separate |      5366932          .024           204
 never ma |     44479637            .2          1188
 living w |     18676762          .083           440
          | 
    Total |    223858353             1          5553
----------------------------------------------------
  Key:  count         =  weighted counts
        proportions   =  cell proportions
        obs           =  number of observations

svy: tab dmdmartl, count se cellwidth(15) format(%15.2g)
(running tabulate on estimation sample)

Number of strata   =        14                 Number of obs      =       5553
Number of PSUs     =        31                 Population size    =  223858353
                                               Design df          =         17

--------------------------------------------
Marital   |
Status    |           count               se
----------+---------------------------------
  married |       118822198         10556102
  widowed |        12586462          1087437
 divorced |        23926362          2606988
 separate |         5366932           614868
 never ma |        44479637          4687152
 living w |        18676762          1874572
          | 
    Total |       223858353                 
--------------------------------------------
  Key:  count            =  weighted counts
        se               =  linearized standard errors of weighted counts

svy: tab dmdmartl, count deff deft cv cellwidth(12) format(%12.2g) 
(running tabulate on estimation sample)

Number of strata   =        14                 Number of obs      =       5553
Number of PSUs     =        31                 Population size    =  223858353
                                               Design df          =         17

------------------------------------------------------------------
Marital   |
Status    |        count            cv          deff          deft
----------+-------------------------------------------------------
  married |    118822198           8.9            50             7
  widowed |     12586462           8.6           2.5           1.6
 divorced |     23926362            11           7.9           2.8
 separate |      5366932            11           1.8           1.3
 never ma |     44479637            11            15           3.9
 living w |     18676762            10           5.1           2.3
          | 
    Total |    223858353                                          
------------------------------------------------------------------
  Key:  count         =  weighted counts
        cv            =  coefficients of variation of weighted counts
        deff          =  deff for variances of weighted counts
        deft          =  deft for variances of weighted counts

Chi-square tests are provided by default when svy: tab is issued with two variables. You will usually want to use the design-based test.

The proportion of observations in each cell can be obtained using either the svy: tab or the svy: proportion command. If you want to compare the proportions within two cells, you can use the lincom command.

svy: tab dmdmartl female, cell obs count cellwidth(12) format(%12.2g)
(running tabulate on estimation sample)

Number of strata   =        14                 Number of obs      =       5553
Number of PSUs     =        31                 Population size    =  223858353
                                               Design df          =         17

----------------------------------------------------
Marital   |       RECODE of riagendr (Gender)       
Status    |         male        female         Total
----------+-----------------------------------------
  married |     58666065      60156133     118822198
          |          .26           .27           .53
          |         1429          1254          2683
          | 
  widowed |      2732447       9854015      12586462
          |         .012          .044          .056
          |          122           345           467
          | 
 divorced |      9287612      14638749      23926362
          |         .041          .065           .11
          |          237           334           571
          | 
 separate |      2019361       3347571       5366932
          |         .009          .015          .024
          |           79           125           204
          | 
 never ma |     25152255      19327382      44479637
          |          .11          .086            .2
          |          634           554          1188
          | 
 living w |      9529522       9147240      18676762
          |         .043          .041          .083
          |          237           203           440
          | 
    Total |    107387263     116471090     223858353
          |          .48           .52             1
          |         2738          2815          5553
----------------------------------------------------
  Key:  weighted counts
        cell proportions
        number of observations

  Pearson:
    Uncorrected   chi2(5)         =  148.4759
    Design-based  F(3.19, 54.18)  =   22.0446     P = 0.0000

svy: proportion dmdmartl, over(female) 
(running proportion on estimation sample)

Survey: Proportion estimation

Number of strata =      14       Number of obs    =       5553
Number of PSUs   =      31       Population size  =  223858353
                                 Design df        =         17

      married: dmdmartl = married
      widowed: dmdmartl = widowed
     divorced: dmdmartl = divorced
    separated: dmdmartl = separated
      _prop_5: dmdmartl = never married
      _prop_6: dmdmartl = living with partner

         male: female = male
       female: female = female

--------------------------------------------------------------
             |             Linearized
        Over | Proportion   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
married      |
        male |   .5463038   .0244819       .494338    .5972798
      female |   .5164898   .0186122      .4772006    .5555763
-------------+------------------------------------------------
widowed      |
        male |   .0254448   .0029445      .0199183    .0324539
      female |   .0846048    .005444      .0738042    .0968207
-------------+------------------------------------------------
divorced     |
        male |   .0864871   .0103046      .0670772    .1108462
      female |   .1256857   .0097098      .1065877    .1476401
-------------+------------------------------------------------
separated    |
        male |   .0188045   .0035212      .0126507    .0278672
      female |   .0287416   .0042363      .0210327    .0391631
-------------+------------------------------------------------
_prop_5      |
        male |   .2342201   .0271275      .1818695    .2961843
      female |   .1659415    .021306      .1257079    .2158725
-------------+------------------------------------------------
_prop_6      |
        male |   .0887398   .0077162      .0737516     .106424
      female |   .0785366   .0066277      .0656434    .0937081
--------------------------------------------------------------

lincom [married]male - [married]female

 ( 1)  [married]male - [married]female = 0

------------------------------------------------------------------------------
  Proportion |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         (1) |   .0298139   .0132293     2.25   0.038     .0019026    .0577253
------------------------------------------------------------------------------

display .5463038 - .5164898
.029814

Graphs with categorical variables

Let’s create a bar graph of the variable female. This graph will show the percent of observations that are female and male. To do this, we will need to create a new variable, which we will call male; it will be the opposite of female.

gen male = !female
graph bar (mean) female male [pw = wtint2yr], percentages bargap(7)

We can also graph the mean of the variable hsq496 by each level of dmdmart1.

svy: mean hsq496, over(dmdmartl)
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      14       Number of obs    =       4692
Number of PSUs   =      31       Population size  =  193938169
                                 Design df        =         17

      married: dmdmartl = married
      widowed: dmdmartl = widowed
     divorced: dmdmartl = divorced
    separated: dmdmartl = separated
    _subpop_5: dmdmartl = never married
    _subpop_6: dmdmartl = living with partner

--------------------------------------------------------------
             |             Linearized
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
hsq496       |
     married |   4.915051   .2864456      4.310704    5.519399
     widowed |   5.450213   .3868792      4.633969    6.266457
    divorced |   7.514127    .473958      6.514163    8.514091
   separated |   7.231938   1.136924      4.833238    9.630637
   _subpop_5 |   6.223865    .408766      5.361444    7.086286
   _subpop_6 |   7.266146   .6322425       5.93223    8.600061
--------------------------------------------------------------

graph hbar hsq496 [pw = wtint2yr], over(dmdmartl, gap(*2)) ///
title("During the last 30 days, for about how many days" ///
"have you felt worried, tense or anxious?")

Finally, we will graph the mean of pad630 for each level of dmdeduc2.

svy: mean pad630, over(dmdeduc2)
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      14        Number of obs    =      1706
Number of PSUs   =      31        Population size  =  77856183
                                  Design df        =        17

    _subpop_1: dmdeduc2 = less than 9th grade
    _subpop_2: dmdeduc2 = no hs diploma
    _subpop_3: dmdeduc2 = hs grad or GED
    _subpop_4: dmdeduc2 = some college or AA degree
    _subpop_5: dmdeduc2 = college grad or above

--------------------------------------------------------------
             |             Linearized
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
pad630       |
   _subpop_1 |   176.8361   19.65517      135.3673    218.3048
   _subpop_2 |   186.4248    6.52083      172.6671    200.1826
   _subpop_3 |   157.9824   7.430086      142.3063    173.6585
   _subpop_4 |   147.1004   10.20694      125.5657    168.6352
   _subpop_5 |   111.1123   7.475883       95.3396    126.8851
--------------------------------------------------------------

graph bar pad630 [pw = wtint2yr], over(dmdeduc2, label(angle(45))) ///
title("How much time do you spend doing" ///
"moderate-intensity activities at work on a typical day?")

OLS regression

Now that we have descriptive statistics on our variables, we may want to run some inferential statistics, such as OLS regression or logistic regression. As before, we simply need to use the svy: prefix before our regression commands.

Please note that the following analyses are shown only as examples of how to do these analyses in Stata. There was no attempt to create substantively meaningful models. Rather, the variables were chosen for illustrative purposes only. We do not recommend that researchers create their models this way.

We will start with an OLS regression with one categorical predictor (female) and one continuous predictor (ridageyr). We will follow the svy: regress command with the margins command, which gives us the predicted values for each level of female.

Notice that we use the vce(unconditional) option with each call to margins. The following information is quoted from the Stata 13 svy manual, page 113: When performing estimations with linearized standard errors, we use the vce(unconditional) option to compute marginal effects so that we can use the results to make inferences on the population. margins with vce(unconditional) uses linearization to compute the unconditional variance of the marginal means.

svy: regress pad630 i.female ridageyr
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =        14                  Number of obs      =      2054
Number of PSUs     =        31                  Population size    =  88768571
                                                Design df          =        17
                                                F(   2,     16)    =     17.18
                                                Prob > F           =    0.0001
                                                R-squared          =    0.0193

------------------------------------------------------------------------------
             |             Linearized
      pad630 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |
     female  |  -33.12513   6.018746    -5.50   0.000    -45.82358   -20.42669
    ridageyr |   -.287604   .1289996    -2.23   0.040    -.5597695   -.0154386
       _cons |   167.2895   9.314455    17.96   0.000     147.6378    186.9413
------------------------------------------------------------------------------

margins female, vce(unconditional)

Predictive margins                              Number of obs      =      2054

Expression   : Linear prediction, predict()

------------------------------------------------------------------------------
             |             Linearized
             |     Margin   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |
       male  |    155.248   6.994755    22.19   0.000     140.4903    170.0056
     female  |   122.1228   5.366491    22.76   0.000     110.8005    133.4452
------------------------------------------------------------------------------

Now let’s run a model with a categorical by categorical interaction term. We can use the contrast command to get the test of the interaction displayed as an F-test rather than a t-statistic. Of course, the p-value is exactly the same. We can use the margins command to get the predicted values of pad630 for each combination of female and hsq571. The marginsplot command is used to obtain a graph of the interaction (marginsplot graphs the values shown in the output from the margins command).

svy: regress pad630 i.female##i.hsq571 ridageyr
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =        14                  Number of obs      =      1673
Number of PSUs     =        31                  Population size    =  76183526
                                                Design df          =        17
                                                F(   4,     14)    =     22.40
                                                Prob > F           =    0.0000
                                                R-squared          =    0.0457

-------------------------------------------------------------------------------
              |             Linearized
       pad630 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
       female |
      female  |  -40.42599   6.586285    -6.14   0.000    -54.32184   -26.53014
              |
       hsq571 |
         yes  |  -42.38641   19.18468    -2.21   0.041    -82.86254   -1.910279
              |
female#hsq571 |
  female#yes  |   59.28491   19.63991     3.02   0.008     17.84833    100.7215
              |
     ridageyr |  -.9869588   .2047143    -4.82   0.000    -1.418868   -.5550493
        _cons |   209.1557   11.97401    17.47   0.000     183.8927    234.4186
-------------------------------------------------------------------------------

contrast female#hsq571

Contrasts of marginal linear predictions

                                                Design df          =        17

Margins      : asbalanced

-------------------------------------------------
              |         df           F        P>F
--------------+----------------------------------
female#hsq571 |          1        9.11     0.0077
       Design |         17
-------------------------------------------------
Note: F statistics are adjusted for the survey
      design.

margins female#hsq571, vce(unconditional)

Predictive margins                              Number of obs      =      1673

Expression   : Linear prediction, predict()

-------------------------------------------------------------------------------
              |             Linearized
              |     Margin   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
female#hsq571 |
     male#no  |   165.6268    6.65576    24.88   0.000     151.5844    179.6692
    male#yes  |   123.2404   22.22847     5.54   0.000     76.34241    170.1383
   female#no  |   125.2008   5.586263    22.41   0.000     113.4148    136.9868
  female#yes  |   142.0993   28.20328     5.04   0.000     82.59557     201.603
-------------------------------------------------------------------------------

marginsplot

  Variables that uniquely identify margins: female hsq571

In this example, we will use a categorical by continuous interaction. In the first call to the margins command, we get the simple slope coefficients for males and females. The p-values in this table tell us that both slopes are significantly different from 0. In the second call to the margins command, we get the predicted values for each level of female at pad680 = 0, pad680 = 200, pad680 = 400, up to pad680 = 1400. The t-statistics and corresponding p-values tell us if the predicted value is different from 0.

svy: regress pad630 i.female##c.pad680 
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =        14                  Number of obs      =      2050
Number of PSUs     =        31                  Population size    =  88700876
                                                Design df          =        17
                                                F(   3,     15)    =    131.11
                                                Prob > F           =    0.0000
                                                R-squared          =    0.0894

---------------------------------------------------------------------------------
                |             Linearized
         pad630 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
         female |
        female  |  -73.89628   18.78819    -3.93   0.001    -113.5359   -34.25665
         pad680 |  -.2389814   .0194172   -12.31   0.000    -.2799481   -.1980148
                |
female#c.pad680 |
        female  |   .1273678   .0445652     2.86   0.011     .0333435    .2213922
                |
          _cons |   234.2159   9.710639    24.12   0.000     213.7282    254.7035
---------------------------------------------------------------------------------

margins female, dydx(pad680) vce(unconditional)

Average marginal effects                        Number of obs      =      2050

Expression   : Linear prediction, predict()
dy/dx w.r.t. : pad680

------------------------------------------------------------------------------
             |             Linearized
             |      dy/dx   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
pad680       |
      female |
       male  |  -.2389814   .0194172   -12.31   0.000    -.2799481   -.1980148
     female  |  -.1116136   .0293859    -3.80   0.001    -.1736123   -.0496149
------------------------------------------------------------------------------

margins female, at(pad680=(0(200)1400)) vsquish vce(unconditional)

Adjusted predictions                            Number of obs      =      2050

Expression   : Linear prediction, predict()
1._at        : pad680          =           0
2._at        : pad680          =         200
3._at        : pad680          =         400
4._at        : pad680          =         600
5._at        : pad680          =         800
6._at        : pad680          =        1000
7._at        : pad680          =        1200
8._at        : pad680          =        1400

------------------------------------------------------------------------------
             |             Linearized
             |     Margin   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  _at#female |
     1#male  |   234.2159   9.710639    24.12   0.000     213.7282    254.7035
   1#female  |   160.3196   12.25815    13.08   0.000     134.4572     186.182
     2#male  |   186.4196   6.921151    26.93   0.000     171.8173     201.022
   2#female  |   137.9969   7.310548    18.88   0.000      122.573    153.4208
     3#male  |   138.6233   5.627637    24.63   0.000       126.75    150.4966
   3#female  |   115.6742   5.070343    22.81   0.000     104.9767    126.3717
     4#male  |   90.82704   6.752807    13.45   0.000     76.57986    105.0742
   4#female  |   93.35145   8.188708    11.40   0.000     76.07479    110.6281
     5#male  |   43.03076   9.470621     4.54   0.000     23.04949    63.01202
   5#female  |   71.02874    13.3223     5.33   0.000     42.92113    99.13634
     6#male  |  -4.765528   12.80418    -0.37   0.714    -31.77999    22.24893
   6#female  |   48.70602   18.89431     2.58   0.020     8.842508    88.56953
     7#male  |  -52.56181   16.38181    -3.21   0.005     -87.1244   -17.99922
   7#female  |    26.3833   24.60871     1.07   0.299    -25.53653    78.30313
     8#male  |  -100.3581   20.07342    -5.00   0.000    -142.7093   -58.00688
   8#female  |   4.060579   30.38526     0.13   0.895    -60.04672    68.16788
------------------------------------------------------------------------------

marginsplot

  Variables that uniquely identify margins: pad680 female

In the example below, we get the difference between the predicted values for males and females at eight values of pad680. We then use the marginsplot command to graph these differences. This graph shows that the difference between males and females is larger at the extreme values of pad680, as is the variability around those estimates.

Notice that is in the output from the margins command above, the predicted value for 4#female is 93.35145 and the predicted value for 4#male is 90.82704. 93.35145-90.82704 = 2.52441, which is the value of dy/dx on line 4 of the output below. The corresponding p-value is .808, which is not statistically significant. You can see this in the graph above, where the point for males and females appear to be on top of each other.

margins, dydx(female) at(pad680=(0(200)1400)) vsquish vce(unconditional)

Conditional marginal effects                    Number of obs      =      2050

Expression   : Linear prediction, predict()
dy/dx w.r.t. : 1.female
1._at        : pad680          =           0
2._at        : pad680          =         200
3._at        : pad680          =         400
4._at        : pad680          =         600
5._at        : pad680          =         800
6._at        : pad680          =        1000
7._at        : pad680          =        1200
8._at        : pad680          =        1400

------------------------------------------------------------------------------
             |             Linearized
             |      dy/dx   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.female     |
         _at |
          1  |  -73.89628   18.78819    -3.93   0.001    -113.5359   -34.25665
          2  |  -48.42271   10.55029    -4.59   0.000    -70.68189   -26.16354
          3  |  -22.94915   5.339066    -4.30   0.000    -34.21359    -11.6847
          4  |   2.524416   10.22678     0.25   0.808    -19.05221    24.10104
          5  |   27.99798   18.42697     1.52   0.147    -10.87952    66.87548
          6  |   53.47154   27.08143     1.97   0.065    -3.665269    110.6084
          7  |   78.94511   35.86278     2.20   0.042     3.281267     154.609
          8  |   104.4187   44.69629     2.34   0.032     10.11775    198.7196
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

marginsplot, yline(0)

  Variables that uniquely identify margins: pad680

marginsplot, recast(line) recastci(rarea) yline(0)

  Variables that uniquely identify margins: pad680

If we use a categorical predictor variable in our model that has more than two levels, we may want to run comparisons between each of the levels. In the example below, we use the educational attainment variable, dmdeduc2, as a predictor. We use the contrast command to determine if, taken together, dmdeduc2, is a statistically significant predictor of our outcome variable, pad630. Next, we use the pwcompare command to conduct all pairwise comparisons. We use the mcompare(sidak) option to adjust the p-values for the multiple comparisons, and we use the pveffects option to have the test statistics and p-values shown in the output. Other options that we could have used with the mcompare option are bonferroni and scheffe.

svy: regress pad630 i.dmdeduc2 ridageyr
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =        14                  Number of obs      =      1706
Number of PSUs     =        31                  Population size    =  77856183
                                                Design df          =        17
                                                F(   5,     13)    =     14.67
                                                Prob > F           =    0.0001
                                                R-squared          =    0.0541

--------------------------------------------------------------------------------------------
                           |             Linearized
                    pad630 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
---------------------------+----------------------------------------------------------------
                  dmdeduc2 |
            no hs diploma  |   5.590284   17.44227     0.32   0.752    -31.20968    42.39025
           hs grad or GED  |  -23.40642   18.92372    -1.24   0.233    -63.33198    16.51915
some college or AA degree  |  -36.62121   23.10453    -1.59   0.131    -85.36752    12.12509
    college grad or above  |  -69.28669   22.12513    -3.13   0.006    -115.9666   -22.60674
                           |
                  ridageyr |  -1.167769   .2339256    -4.99   0.000    -1.661309   -.6742287
                     _cons |   235.0531   18.73331    12.55   0.000     195.5293    274.5769
--------------------------------------------------------------------------------------------

contrast dmdeduc2

Contrasts of marginal linear predictions

                                                Design df          =        17

Margins      : asbalanced

------------------------------------------------
             |         df           F        P>F
-------------+----------------------------------
    dmdeduc2 |          4       14.46     0.0001
      Design |         17
------------------------------------------------
Note: F statistics are adjusted for the survey
      design.

pwcompare dmdeduc2, mcompare(sidak) cformat(%3.1f) pveffects

Pairwise comparisons of marginal linear predictions

                                                Design df          =        17

Margins      : asbalanced

---------------------------
             |    Number of
             |  Comparisons
-------------+-------------
    dmdeduc2 |           10
---------------------------

--------------------------------------------------------------------------------------------
                                                    |                              Sidak
                                                    |   Contrast   Std. Err.      t    P>|t|
----------------------------------------------------+---------------------------------------
                                           dmdeduc2 |
              no hs diploma vs less than 9th grade  |        5.6       17.4     0.32   1.000
             hs grad or GED vs less than 9th grade  |      -23.4       18.9    -1.24   0.929
  some college or AA degree vs less than 9th grade  |      -36.6       23.1    -1.59   0.756
      college grad or above vs less than 9th grade  |      -69.3       22.1    -3.13   0.059
                   hs grad or GED vs no hs diploma  |      -29.0        9.8    -2.95   0.087
        some college or AA degree vs no hs diploma  |      -42.2       10.0    -4.23   0.006
            college grad or above vs no hs diploma  |      -74.9       10.2    -7.31   0.000
       some college or AA degree vs hs grad or GED  |      -13.2       13.9    -0.95   0.988
           college grad or above vs hs grad or GED  |      -45.9       13.2    -3.47   0.029
college grad or above vs some college or AA degree  |      -32.7       12.2    -2.68   0.147
--------------------------------------------------------------------------------------------

Logistic regression

Let’s run a logistic regression. We will use the variable paq665 (do you do any moderate-intensity sports, fitness or recreational activities that cause a small increase in breathing or heart rate at least 10 minutes continually?) as the outcome variable. We will also limit the analysis to those who are greater than age 20. Before we run our logistic regression, let’s look at the distribution of the outcome variable, and let’s look at the crosstabulation of the outcome variable with our categorical predictor variable. The purpose of the first descriptive analysis is to assess the proportion and number of cases in each level of the outcome variable. If we found that there was a very small proportion of cases in one of the levels, we may encounter estimation problems with the logistic regression. Likewise, if the crosstabulation between the outcome variable and the categorical predictor variable showed that one or more cells had very few unweighted observations or no observations at all, we might want to recode the predictor variable into fewer categories so that there were a sufficient number of observations in each cell. The results below do not look problematic, although there are relatively few observations that are coded as 1 on our outcome variable and 5 on our predictor variable. This might increase the standard error around that point estimate.

svy, subpop(if ridageyr > 20): tab paq665, ///
cell obs count cellwidth(12) format(%12.2g)
(running tabulate on estimation sample)

Number of strata   =        14                 Number of obs      =       9755
Number of PSUs     =        31                 Population size    =  306516207
                                               Subpop. no. of obs =       5443
                                               Subpop. size       =  218554854
                                               Design df          =         17

----------------------------------------------------
Moderate  |
recreatio |
nal       |
activitie |
s         |        count   proportions           obs
----------+-----------------------------------------
       no |    116709772           .53          3195
      yes |    101845081           .47          2248
          | 
    Total |    218554854             1          5443
----------------------------------------------------
  Key:  count         =  weighted counts
        proportions   =  cell proportions
        obs           =  number of observations

svy, subpop(if ridageyr > 20): tab hsd010 paq665, ///
cell obs count cellwidth(12) format(%12.2g)
(running tabulate on estimation sample)

Number of strata   =        14                 Number of obs      =       8915
Number of PSUs     =        31                 Population size    =  277379812
                                               Subpop. no. of obs =       4603
                                               Subpop. size       =  189418459
                                               Design df          =         17

----------------------------------------------------
General   |
health    |     Moderate recreational activities    
condition |           no           yes         Total
----------+-----------------------------------------
 excellen |      9401577      13365343      22766920
          |          .05          .071           .12
          |          223           234           457
          | 
 very goo |     27412706      34747263      62159969
          |          .14           .18           .33
          |          591           650          1241
          | 
     good |     41493331      31885029      73378359
          |          .22           .17           .39
          |         1092           753          1845
          | 
     fair |     18274770       8295293      26570064
          |         .096          .044           .14
          |          634           256           890
          | 
     poor |      3896015        647132       4543147
          |         .021         .0034          .024
          |          145            25           170
          | 
    Total |    100478399      88940060     189418459
          |          .53           .47             1
          |         2685          1918          4603
----------------------------------------------------
  Key:  weighted counts
        cell proportions
        number of observations

  Pearson:
    Uncorrected   chi2(4)         =  386.5375
    Design-based  F(2.76, 46.90)  =   35.0850     P = 0.0000

Now let’s run the logistic regression. We can use the contrast command to get the multi-degree-of-freedom test of hsd010.

svy, subpop(if ridageyr > 20): logit paq665 ib3.hsd010 c.ridageyr
(running logit on estimation sample)

Survey: Logistic regression

Number of strata   =        14                 Number of obs      =       8915
Number of PSUs     =        31                 Population size    =  277379812
                                               Subpop. no. of obs =       4603
                                               Subpop. size       =  189418459
                                               Design df          =         17
                                               F(   5,     13)    =      37.01
                                               Prob > F           =     0.0000

------------------------------------------------------------------------------
             |             Linearized
      paq665 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      hsd010 |
  excellent  |   .5986804   .1120919     5.34   0.000     .3621872    .8351736
  very good  |   .4914449   .1141339     4.31   0.000     .2506434    .7322465
       fair  |  -.4945664   .1127125    -4.39   0.000    -.7323691   -.2567638
       poor  |  -1.450551   .2048429    -7.08   0.000    -1.882732    -1.01837
             |
    ridageyr |  -.0109676     .00292    -3.76   0.002    -.0171282    -.004807
       _cons |   .2653286   .1382701     1.92   0.072    -.0263958     .557053
------------------------------------------------------------------------------

contrast hsd010

Contrasts of marginal linear predictions

                                                Design df          =        17

Margins      : asbalanced

------------------------------------------------
             |         df           F        P>F
-------------+----------------------------------
      hsd010 |          4       44.44     0.0000
      Design |         17
------------------------------------------------
Note: F statistics are adjusted for the survey
      design.

Let’s include a categorical by continuous interaction in the model. We will use the contrast command to determine if the interaction term (as a whole) is statistically significant. We will use the contrast command to get the multi-degree-of-freedom test of the interaction, the margins command to get the predicted probabilities and the marginsplot command to graph the interaction.

svy, subpop(if ridageyr > 20): logit paq665 ib3.hsd010##c.ridageyr
(running logit on estimation sample)

Survey: Logistic regression

Number of strata   =        14                 Number of obs      =       8915
Number of PSUs     =        31                 Population size    =  277379812
                                               Subpop. no. of obs =       4603
                                               Subpop. size       =  189418459
                                               Design df          =         17
                                               F(   9,      9)    =      50.07
                                               Prob > F           =     0.0000

-----------------------------------------------------------------------------------
                  |             Linearized
           paq665 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
           hsd010 |
       excellent  |   .3858144   .5191089     0.74   0.467    -.7094096    1.481038
       very good  |    .652864   .2254827     2.90   0.010     .1771372    1.128591
            fair  |   .3499552   .3042182     1.15   0.266    -.2918891    .9917995
            poor  |  -.9774887   1.318543    -0.74   0.469    -3.759372    1.804395
                  |
         ridageyr |  -.0083106    .004617    -1.80   0.090    -.0180516    .0014303
                  |
hsd010#c.ridageyr |
       excellent  |   .0046206   .0101254     0.46   0.654    -.0167421    .0259833
       very good  |  -.0033669   .0042601    -0.79   0.440    -.0123549    .0056211
            fair  |  -.0170309   .0064368    -2.65   0.017    -.0306114   -.0034504
            poor  |  -.0090097   .0265987    -0.34   0.739    -.0651279    .0471086
                  |
            _cons |   .1376637   .2029975     0.68   0.507    -.2906237    .5659511
-----------------------------------------------------------------------------------

contrast hsd010#c.ridageyr

Contrasts of marginal linear predictions

                                                Design df          =        17

Margins      : asbalanced

-----------------------------------------------------
                  |         df           F        P>F
------------------+----------------------------------
hsd010#c.ridageyr |          4        4.73     0.0126
           Design |         17
-----------------------------------------------------
Note: F statistics are adjusted for the survey
      design.

margins hsd010, subpop(if ridageyr > 20) at(ridageyr=(20(10)80)) ///
vsquish vce(unconditional)

Adjusted predictions                            Number of obs      =      8915
                                                Subpop. no. of obs =      4603

Expression   : Pr(paq665), predict()
1._at        : ridageyr        =          20
2._at        : ridageyr        =          30
3._at        : ridageyr        =          40
4._at        : ridageyr        =          50
5._at        : ridageyr        =          60
6._at        : ridageyr        =          70
7._at        : ridageyr        =          80

------------------------------------------------------------------------------
             |             Linearized
             |     Margin   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  _at#hsd010 |
1#excellent  |   .6105625   .0627257     9.73   0.000     .4782229    .7429022
1#very good  |   .6357529    .030173    21.07   0.000     .5720935    .6994123
     1#good  |   .4928633   .0299032    16.48   0.000     .4297731    .5559535
     1#fair  |   .4951972   .0374002    13.24   0.000     .4162897    .5741046
     1#poor  |   .2339337   .1366427     1.71   0.105    -.0543571    .5222245
2#excellent  |   .6017536   .0467599    12.87   0.000     .5030987    .7004084
2#very good  |   .6083073   .0251426    24.19   0.000      .555261    .6613536
     2#good  |   .4721152   .0212358    22.23   0.000     .4273116    .5169188
     2#fair  |   .4322622   .0318406    13.58   0.000     .3650845    .4994399
     2#poor  |   .2043323   .0853799     2.39   0.029     .0241965    .3844681
3#excellent  |   .5928782   .0329296    18.00   0.000     .5234028    .6623537
3#very good  |   .5801593   .0224851    25.80   0.000     .5327198    .6275988
     3#good  |    .451463   .0165591    27.26   0.000     .4165264    .4863996
     3#fair  |   .3714403   .0279114    13.31   0.000     .3125525    .4303282
     3#poor  |   .1776082    .043438     4.09   0.001     .0859621    .2692544
4#excellent  |   .5839418   .0257247    22.70   0.000     .5296674    .6382163
4#very good  |     .55148   .0236447    23.32   0.000      .501594    .6013659
     4#good  |   .4309767   .0189088    22.79   0.000     .3910827    .4708706
     4#fair  |   .3144367   .0261895    12.01   0.000     .2591816    .3696917
     4#poor  |   .1537041   .0180881     8.50   0.000     .1155416    .1918666
5#excellent  |   .5749499   .0307861    18.68   0.000     .5099969    .6399029
5#very good  |   .5224542   .0284286    18.38   0.000     .4624752    .5824332
     5#good  |   .4107239   .0261597    15.70   0.000     .3555316    .4659161
     5#fair  |   .2625274   .0261464    10.04   0.000     .2073633    .3176914
     5#poor  |   .1324989    .030077     4.41   0.000     .0690418    .1959559
6#excellent  |   .5659081   .0443458    12.76   0.000     .4723467    .6594696
6#very good  |   .4932759   .0354001    13.93   0.000     .4185883    .5679635
     6#good  |   .3907692   .0350558    11.15   0.000     .3168079    .4647305
     6#fair  |   .2164816   .0266714     8.12   0.000     .1602098    .2727533
     6#poor  |   .1138257   .0489862     2.32   0.033     .0104739    .2171774
7#excellent  |   .5568223   .0611297     9.11   0.000     .4278498    .6857948
7#very good  |   .4641434   .0433376    10.71   0.000     .3727091    .5555776
     7#good  |   .3711734   .0442524     8.39   0.000      .277809    .4645377
     7#fair  |   .1765782    .026898     6.56   0.000     .1198284    .2333281
     7#poor  |   .0974883   .0635355     1.53   0.143    -.0365598    .2315365
------------------------------------------------------------------------------

marginsplotVariables that uniquely identify margins: ridageyr hsd010

For our last example, we will run a logistic regression model and use the estat gof command to get the Archer-Lemeshow test (which is a modification of the Hosmer-Lemeshow test that can be used with survey data). The numerator degrees of freedom are calculated as g - 1, and the denominator degrees of freedom are calculated as f - g + 2, where f = the number of sampled clusters - the number of strata and g = the number of groups (10 is the default). For the following example, we have 10 - 1 = 9 numerator degrees of freedom and 17 - 10 + 2 = 9 denominator degrees of freedom.

As of Stata 13.1, estat gof cannot follow a model specified with a subpopulation.

svy: logit paq665 i.female##c.pad680 
(running logit on estimation sample)

Survey: Logistic regression

Number of strata   =        14                 Number of obs      =       6742
Number of PSUs     =        31                 Population size    =  255503942
                                               Design df          =         17
                                               F(   3,     15)    =       1.16
                                               Prob > F           =     0.3593

---------------------------------------------------------------------------------
                |             Linearized
         paq665 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
         female |
        female  |   .1277199   .1276729     1.00   0.331    -.1416463    .3970861
         pad680 |   .0000791   .0002271     0.35   0.732       -.0004    .0005582
                |
female#c.pad680 |
        female  |  -.0003267   .0002309    -1.41   0.175    -.0008139    .0001605
                |
          _cons |  -.1236965   .1246061    -0.99   0.335    -.3865923    .1391994
---------------------------------------------------------------------------------

estat gof

Logistic model for paq665, goodness-of-fit test

                       F(9,9) =         0.89
                     Prob > F =         0.5656

For more information on using the NHANES data sets

There are helpful resources for learning how to analyze the NHANES data sets correctly. One is a listserv at http://www.cdc.gov/nchs/nhanes/nhanes_listserv.htm . There are also online tutorials at http://www.cdc.gov/nchs/tutorials/index.htm .

References

Applied Survey Data Analysis by Steven G. Heeringa, Brady T. West, and Patricia A. Berglund

A survey of survey statistics: What is done and can be done in Stata by Frauke Kreuter and Richard Valliant, The Stata Journal (2007), Volume 7, Number 1, pages 1-21.

Goodness-of-fit test for a logistic regression model fitted using survey sample data by Kellie J. Archer and Stanley Lemeshow, The Stata Journal (2006), Volume 6, Number 1, pages 97-105.

Analysis of Health Surveys by Edward L. Korn and Barry I. Graubard

Sampling of Populations: Methods and Applications, Fourth Edition by Paul Levy and Stanley Lemeshow

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner

Sampling Techniques, Third Edition by William G. Cochran

Stata 13 Manual: Survey Data