Survey Data Analysis in Stata 17

The purpose of this workshop is to explore some issues in the analysis of survey data using Stata 17. Before we begin, you will want to be sure that your copy of Stata is up-to-date. To do this, please type

update all

in the Stata command window and follow any instructions given. These updates include not only fixes to known bugs, but also add some new features that may be useful.

Before we begin looking at examples in Stata, we will quickly review some basic issues and concepts in survey data analysis.

NOTE: Most of the commands in this seminar will work with Stata versions 12-16.

Why do we need survey data analysis software?

Regular procedures in statistical software (that is not designed for survey data) analyzes data as if the data were collected using simple random sampling. For experimental and quasi-experimental designs, this is exactly what we want. However, very few surveys use a simple random sample to collect data. Not only is it nearly impossible to do so, but it is not as efficient (either financially and statistically) as other sampling methods. When any sampling method other than simple random sampling is used, we usually need to use survey data analysis software to take into account the differences between the design that was used to collect the data and simple random sampling. This is because the sampling design affects both the calculation of the point estimates and the standard errors of those estimates. If you ignore the sampling design, e.g., if you assume simple random sampling when another type of sampling design was used, both the point estimates and their standard errors will likely be calculated incorrectly. The sampling weight will affect the calculation of the point estimate, and the stratification and/or clustering will affect the calculation of the standard errors. Ignoring the clustering will likely lead to standard errors that are underestimated, possibly leading to results that seem to be statistically significant, when in fact, they are not. The difference in point estimates and standard errors obtained using non-survey software and survey software with the design properly specified will vary from data set to data set, and even between analyses using the same data set. While it may be possible to get reasonably accurate results using non-survey software, there is no practical way to know beforehand how far off the results from non-survey software will be.

Sampling designs

Most people do not conduct their own surveys. Rather, they use survey data that some agency or company collected and made available to the public. The documentation must be read carefully to find out what kind of sampling design was used to collect the data. This is very important because many of the estimates and standard errors are calculated differently for the different sampling designs. Hence, if you mis-specify the sampling design, the point estimates and standard errors will likely be wrong.

Below are some common features of many sampling designs.

Sampling weights: There are several types of weights that can be associated with a survey. Perhaps the most common is the sampling weight. A sampling weight is a probability weight that has had one or more adjustments made to it. Both a sampling weight and a probability weight are used to weight the sample back to the population from which the sample was drawn. By definition, a probability weight is the inverse of the probability of being included in the sample due to the sampling design (except for a certainty PSU, see below). The probability weight, called a pweight in Stata, is calculated as N/n, where N = the number of elements in the population and n = the number of elements in the sample. For example, if a population has 10 elements and 3 are sampled at random with replacement, then the probability weight would be 10/3 = 3.33. In a two-stage design, the probability weight is calculated as f₁f₂, which means that the inverse of the sampling fraction for the first stage is multiplied by the inverse of the sampling fraction for the second stage. Under many sampling plans, the sum of the probability weights will equal the population total.

While many textbooks will end their discussion of probability weights here, this definition does not fully describe the sampling weights that are included with actual survey data sets. Rather, the sampling weight, which is sometimes called a “final weight,” starts with the inverse of the sampling fraction, but then incorporates several other values, such as corrections for unit non-response, errors in the sampling frame (sometimes called non-coverage), calibration and trimming. Because these other values are included in the probability weight that is included with the data set, it is often inadvisable to modify the sampling weights, such as trying to standardize them for a particular variable, e.g., age.

Strata: Stratification is a method of breaking up the population into different groups, often by demographic variables such as gender, race or SES. Each element in the population must belong to one, and only one, strata. Once the strata have been defined, samples are taken from each stratum as if it were independent of all of the other strata. For example, if a sample is to be stratified on gender, men and women would be sampled independently of one another. This means that the probability weights for men will likely be different from the probability weights for the women. In most cases, you need to have two or more PSUs in each stratum. The purpose of stratification is to reduce the standard error of the estimates, and stratification works most effectively when the variance of the dependent variable is smaller within the strata than in the sample as a whole.

PSU: This is the primary sampling unit. This is the first unit that is sampled in the design. For example, school districts from California may be sampled and then schools within districts may be sampled. The school district would be the PSU. If states from the US were sampled, and then school districts from within each state, and then schools from within each district, then states would be the PSU. One does not need to use the same sampling method at all levels of sampling. For example, probability-proportional-to-size sampling may be used at level 1 (to select states), while cluster sampling is used at level 2 (to select school districts). In the case of a simple random sample, the PSUs and the elementary units are the same. In general, accounting for the clustering in the data (i.e., using the PSUs), will increase the standard errors of the point estimates. Conversely, ignoring the PSUs will tend to yield standard errors that are too small, leading to false positives when doing significance tests.

FPC: This is the finite population correction. This is used when the sampling fraction (the number of elements or respondents sampled relative to the population) becomes large. The FPC is used in the calculation of the standard error of the estimate. If the value of the FPC is close to 1, it will have little impact and can be safely ignored. In some survey data analysis programs, such as SUDAAN, this information will be needed if you specify that the data were collected without replacement (see below for a definition of “without replacement”). The formula for calculating the FPC is ((N-n)/(N-1))^1/2, where N is the number of elements in the population and n is the number of elements in the sample. To see the impact of the FPC for samples of various proportions, suppose that you had a population of 10,000 elements.

Sample size (n)    FPC
1                1.0000
10                .9995 
100               .9950
500               .9747
1000              .9487
5000              .7071
9000              .3162

Replicate weights: Replicate weights are a series of weight variables that are used to correct the standard errors for the sampling plan. They serve the same function as the PSU and strata variables (which are used a Taylor series linearization) to correct the standard errors of the estimates for the sampling design. Many public use data sets are now being released with replicate weights instead of PSUs and strata in an effort to more securely protect the identity of the respondents. In theory, the same standard errors will be obtained using either the PSU and strata or the replicate weights. There are different ways of creating replicate weights; the method used is determined by the sampling plan. The most common are balanced repeated and jackknife replicate weights. You will need to read the documentation for the survey data set carefully to learn what type of replicate weight is included in the data set; specifying the wrong type of replicate weight will likely lead to incorrect standard errors. For more information on replicate weights, please see Stata Library: Replicate Weights and Appendix D of the WesVar Manual by Westat, Inc. Several statistical packages, including Stata, SAS, R, Mplus, SUDAAN and WesVar, allow the use of replicate weights.

Consequences of not using the design elements

Sampling design elements include the sampling weights, post-stratification weights (if provided), PSUs, strata, and replicate weights. Rarely are all of these elements included in a single public-use data set. However, ignoring the design elements that are included can often lead to inaccurate point estimates and/or inaccurate standard errors.

Sampling with and without replacement

Most samples collected in the real world are collected “without replacement”. This means that once a respondent has been selected to be in the sample and has participated in the survey, that particular respondent cannot be selected again to be in the sample. Many of the calculations change depending on if a sample is collected with or without replacement. Hence, programs like SUDAAN request that you specify if a survey sampling design was implemented with our without replacement, and an FPC is used if sampling without replacement is used, even if the value of the FPC is very close to one.

Examples

For the examples in this workshop, we will use the data set from NHANES (the National Health and Nutrition Examination Survey) that comes with Stata.

Reading the documentation

The first step in analyzing any survey data set is to read the documentation. With many of the public use data sets, the documentation can be quite extensive and sometimes even intimidating. Instead of trying to read the documentation “cover to cover”, there are some parts you will want to focus on. First, read the Introduction. This is usually an “easy read” and will orient you to the survey. There is usually a section or chapter called something like “Sample Design and Analysis Guidelines”, “Variance Estimation”, etc. This is the part that tells you about the design elements included with the survey and how to use them. Some even give example code. If multiple sampling weights have been included in the data set, there will be some instruction about when to use which one. If there is a section or chapter on missing data or imputation, please read that. This will tell you how missing data were handled. You should also read any documentation regarding the specific variables that you intend to use. As we will see little later on, we will need to look at the documentation to get the value labels for the variables. This is especially important because some of the values are actually missing data codes, and you need to do something so that Stata doesn’t treat those as valid values (or you will get some very “interesting” means, totals, etc.).

Getting NHANES data into Stata

For the examples in this workshop, we will be using a version of the NHANES II dataset that comes with Stata. However, for real analyses, you need to download data from the NHANES website. The following commands can be used to open the NHANES data in Stata and save them in Stata format. I have also sorted the data before saving them (because I will merge the files), but this is not technically necessary.

* demographics
clear
fdause "D:\Data\Seminars\Applied Survey Stata 17\demo_g.xpt"
sort seqn
save "D:\Data\Seminars\Applied Survey Stata 17\demo_g", replace

The do-file that imports the data, merges the files and recodes the variables can be found here. The data file created by this do-file is not the data file used in this workshop. It is linked here to provide an example of how NHANES data can be imported into Stata, cleaned and merged to create a working data file.

The svyset command

Before we can start our analyses, we need to issue the svyset command. The svyset command tells Stata about the design elements in the survey. Once this command has been issued, all you need to do for your analyses is use the svy: prefix before each command. Because the 2011-2012 NHANES data were released with a sampling weight (wtint2yr), a PSU variable (sdmvpsu) and a strata variable (sdmvstra), we will use these our svyset command. The svyset command looks like this:

use https://www.stata-press.com/data/r17/nhanes2f, clear
svyset psuid [pweight=finalwgt], strata(stratid) singleunit(centered)

Sampling weights: finalwgt
             VCE: linearized
     Single unit: centered
        Strata 1: stratid
 Sampling unit 1: psuid
           FPC 1:

The singleunit option was added in Stata 10. This option allows for different ways of handling a single PSU in a stratum. There are usually two or more PSUs in each stratum. If there is only one PSU in stratum, due to missing data or a subpopulation specification, for example, the variance cannot be estimated in that stratum. If you use the default option, missing, then you will get no standard errors when Stata encounters a single PSU in a stratum. There are three other options. One is certainty, meaning that the singleton PSUs be treated as certainty PSUs; certainty PSUs are PSUs that were selected into the sample with a probability of 1 (in other words, these PSUs were certain to be in the sample) and do not contribute to the standard error. The scaled option gives a scaled version of the certainty option. The scaling factor comes from using the average of the variances from the strata with multiple sampling units for each stratum with one PSU. The centered option centers strata with one sampling unit at the grand mean instead of the stratum mean.

Now that we have issued the svyset command, we can use the svydescribe command to get information on the strata and PSUs. It is a good idea to use this command to learn about the strata and PSUs in the dataset.

svydescribe

Survey: Describing stage 1 sampling units

Sampling weights: finalwgt
             VCE: linearized
     Single unit: centered
        Strata 1: stratid
 Sampling unit 1: psuid
           FPC 1: 

                                    Number of obs per unit
 Stratum   # units     # obs       Min      Mean       Max
----------------------------------------------------------
       1         2       380       165     190.0       215
       2         2       185        67      92.5       118
       3         2       348       149     174.0       199
       4         2       460       229     230.0       231
       5         2       252       105     126.0       147
       6         2       298       131     149.0       167
       7         2       476       206     238.0       270
       8         2       337       158     168.5       179
       9         2       243       100     121.5       143
      10         2       262       119     131.0       143
      11         2       275       120     137.5       155
      12         2       314       144     157.0       170
      13         2       342       154     171.0       188
      14         2       405       200     202.5       205
      15         2       380       189     190.0       191
      16         2       336       159     168.0       177
      17         2       393       180     196.5       213
      18         2       359       144     179.5       215
      20         2       283       125     141.5       158
      21         2       213       102     106.5       111
      22         2       301       128     150.5       173
      23         2       340       158     170.0       182
      24         2       434       202     217.0       232
      25         2       254       115     127.0       139
      26         2       261       129     130.5       132
      27         2       283       139     141.5       144
      28         2       298       135     149.0       163
      29         2       502       215     251.0       287
      30         2       365       166     182.5       199
      31         2       308       143     154.0       165
      32         2       450       211     225.0       239
----------------------------------------------------------
      31        62    10,337        67     166.7       287

The output above tells us that there are 31 strata with two PSUs in each. There are 62 PSUs. There are a different number of observations in each PSU and in each strata. There are a minimum of 67 observations, a maximum of 287 observations and an average of 166.7 observations in the PSUs. The design degrees of freedom is calculated as the number of strata minus the number of PSUs. For this data set, that is 62-31 = 31.

Descriptive statistics

We will start by calculating some descriptive statistics of some of the continuous variables. We can use the svy: mean command to get the mean of continuous variables, and we can follow that command with the estat sd command to get the standard deviation or variance of the variable. The var option on the estat sd command can be used to get the variance.

svy: mean age
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31            Number of obs   =      10,337
Number of PSUs   = 62            Population size = 117,023,659
                                 Design df       =          31

--------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
-------------+------------------------------------------------
         age |   42.23732   .3034412      41.61844    42.85619
--------------------------------------------------------------

estat sd

-------------------------------------
             |       Mean   Std. dev.
-------------+-----------------------
         age |   42.23732    15.50095
-------------------------------------

estat sd, var

-------------------------------------
             |       Mean    Variance
-------------+-----------------------
         age |   42.23732    240.2795
-------------------------------------

svy: mean copper
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31            Number of obs   =       9,118
Number of PSUs   = 62            Population size = 103,505,700
                                 Design df       =          31

--------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
-------------+------------------------------------------------
      copper |   124.7232   .6657517      123.3654     126.081
--------------------------------------------------------------

svy: mean hct // hemocrit
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31            Number of obs   =      10,337
Number of PSUs   = 62            Population size = 117,023,659
                                 Design df       =          31

--------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
-------------+------------------------------------------------
         hct |   41.98352   .0762331      41.82804      42.139
--------------------------------------------------------------

Notice that a different number of observations were used in each analysis. This is important, because if you use these three variables in the same call to svy: mean, you will get different estimates of the means for each of the variables. This is because copper has missing values. There are 9118 observations that have valid (AKA non-missing) values for all three variables, and those are used in the calculations of the means when all of the variables are used in the same call to svy: mean.

svy: mean age copper hct
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31            Number of obs   =       9,118
Number of PSUs   = 62            Population size = 103,505,700
                                 Design df       =          31

--------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
-------------+------------------------------------------------
         age |   42.32525   .3239445      41.66457    42.98594
      copper |   124.7232   .6657517      123.3654     126.081
         hct |   41.97078   .0843598      41.79873    42.14283
--------------------------------------------------------------

Let’s get some descriptive statistics on some of the binary variables, such as female and rural.

svy: mean female
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31            Number of obs   =      10,337
Number of PSUs   = 62            Population size = 117,023,659
                                 Design df       =          31

--------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
-------------+------------------------------------------------
      female |   .5204215    .005773      .5086474    .5321956
--------------------------------------------------------------

estat sd

-------------------------------------
             |       Mean   Std. dev.
-------------+-----------------------
      female |   .5204215     .499607
-------------------------------------

svy: mean rural
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31            Number of obs   =      10,337
Number of PSUs   = 62            Population size = 117,023,659
                                 Design df       =          31

--------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
-------------+------------------------------------------------
       rural |   .3174388   .0212654      .2740677    .3608099
--------------------------------------------------------------

estat sd

-------------------------------------
             |       Mean   Std. dev.
-------------+-----------------------
       rural |   .3174388    .4655023
-------------------------------------

Taking the mean of a variable that is coded 0/1 gives the proportion of 1s. The output above indicates that approximately 52% of the observations in our data set are from females; 31.74% of respondents live in a rural area.

Of course, the svy: tab command can also be used with binary variables. The proportion of .5204 matches the .5204215 that we found with the svy: mean command.

svy: tab female
(running tabulate on estimation sample)

Number of strata = 31                            Number of obs   =      10,337
Number of PSUs   = 62                            Population size = 117,023,659
                                                 Design df       =          31

----------------------
   Female | proportion
----------+-----------
     Male |      .4796
   Female |      .5204
          | 
    Total |          1
----------------------
Key: proportion = Cell proportion

By default, svy: tab gives proportions. If you would prefer to see the actual counts, you will need to use the count option. Oftentimes, the counts are too large for the display space in the table, so other options, such cellwidth and format, need to be used to display the counts. The missing option is often useful if the variable has any missing values (the variable female does not).

svy: tab female, missing count cellwidth(15) format(%15.2g)
(running tabulate on estimation sample)

Number of strata = 31                            Number of obs   =      10,337
Number of PSUs   = 62                            Population size = 117,023,659
                                                 Design df       =          31

---------------------------
   Female |           count
----------+----------------
     Male |        56122035
   Female |        60901624
          | 
    Total |       117023659
---------------------------
Key:           count = Weighted count

svy: tab female rural, missing count cellwidth(15) format(%15.2g)
(running tabulate on estimation sample)

Number of strata = 31                            Number of obs   =      10,337
Number of PSUs   = 62                            Population size = 117,023,659
                                                 Design df       =          31

-------------------------------------------------------------
          |                       Rural                      
   Female |           Urban            Rural            Total
----------+--------------------------------------------------
     Male |        37327257         18794778         56122035
   Female |        42548556         18353068         60901624
          | 
    Total |        79875813         37147846        117023659
-------------------------------------------------------------
Key: Weighted count

  Pearson:
    Uncorrected   chi2(1)         =   13.3910
    Design-based  F(1, 31)        =   12.3092     P = 0.0014

Using the col option with svy: tab will give the column proportions. As you can see, the values in the column “Total” are the same as those from svy: tab female.

svy: tab female rural, col 
(running tabulate on estimation sample)

Number of strata = 31                            Number of obs   =      10,337
Number of PSUs   = 62                            Population size = 117,023,659
                                                 Design df       =          31

-------------------------------
          |        Rural       
   Female | Urban  Rural  Total
----------+--------------------
     Male | .4673  .5059  .4796
   Female | .5327  .4941  .5204
          | 
    Total |     1      1      1
-------------------------------
Key: Column proportion

  Pearson:
    Uncorrected   chi2(1)         =   13.3910
    Design-based  F(1, 31)        =   12.3092     P = 0.0014

Of course, the svy: proportion command can also be used to get proportions.

svy: proportion female
(running proportion on estimation sample)

Survey: Proportion estimation

Number of strata = 31            Number of obs   =      10,337
Number of PSUs   = 62            Population size = 117,023,659
                                 Design df       =          31

--------------------------------------------------------------
             |             Linearized            Logit
             | Proportion   std. err.     [95% conf. interval]
-------------+------------------------------------------------
      female |
       Male  |   .4795785    .005773      .4678179    .4913618
     Female  |   .5204215    .005773      .5086382    .5321821
--------------------------------------------------------------

Using the estat post-estimation command after the svy: mean command also allows you to get the design effects, misspecification effects, unweighted and weighted sample sizes, or the coefficient of variation.

The Deff and the Deft are types of design effects, which tell you about the efficiency of your sample. The Deff is a ratio of two variances. In the numerator we have the variance estimate from the current sample (including all of its design elements), and in the denominator we have the variance from a hypothetical sample of the same size drawn as an SRS. In other words, the Deff tells you how efficient your sample is compared to an SRS of equal size. If the Deff is less than 1, your sample is more efficient than SRS; usually the Deff is greater than 1. The Deft is the ratio of two standard error estimates. Again, the numerator is the standard error estimate from the current sample. The denominator is a hypothetical SRS (with replacement) standard error from a sample of the same size as the current sample.

You can also use the meff and the meft option to get the misspecification effects. Misspecification effects are a ratio of the variance estimate from the current analysis to a hypothetical variance estimated from a misspecified model. Please see the Stata documentation for more details on how these are calculated.

The coefficient of variation is the ratio of the standard error to the mean, multiplied by 100% (see pages 38-40 of the Stata 17 svy manual). It is an indication of the variability relative to the mean in the population and is not affected by the unit of measurement of the variable.

svy: mean age
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31            Number of obs   =      10,337
Number of PSUs   = 62            Population size = 117,023,659
                                 Design df       =          31

--------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
-------------+------------------------------------------------
         age |   42.23732   .3034412      41.61844    42.85619
--------------------------------------------------------------

estat effects

----------------------------------------------------------
             |             Linearized
             |       Mean   std. err.       DEFF      DEFT
-------------+--------------------------------------------
         age |   42.23732   .3034412      3.9612   1.99028
----------------------------------------------------------

estat effects, deff deft meff meft

------------------------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.       DEFF      DEFT      MEFF      MEFT
-------------+----------------------------------------------------------------
         age |   42.23732   .3034412      3.9612   1.99028     3.211   1.79193
------------------------------------------------------------------------------

estat size

----------------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.              Obs            Size
-------------+--------------------------------------------------------
         age |   42.23732   .3034412            10,337     117,023,659
----------------------------------------------------------------------

estat cv

------------------------------------------------
             |             Linearized
             |       Mean   std. err.     CV (%)
-------------+----------------------------------
         age |   42.23732   .3034412      .71842
------------------------------------------------

display (.3034412/42.23732)*100
.71841963

Analysis of subpopulations

Before we continue with our descriptive statistics, we should pause to discuss the analysis of subpopulations. The analysis of subpopulations is one place where survey data and experimental data are quite different. If you have data from an experiment (or quasi-experiment), and you want to analyze the responses from, say, just the women, or just people over age 50, you can just delete the unwanted cases from the data set or use the by: prefix. Complex survey data are different. With survey data, you (almost) never get to delete any cases from the data set, even if you will never use them in any of your analyses. Because of the way the -by- prefix works, you usually don’t use it with survey data either. (For an excellent discussion of conditional versus unconditional subdomain, please this presentation by Brady West, Ph.D.) Instead, Stata has provided two options that allow you to correctly analyze subpopulations of your survey data. These options are subpop and over. The subpop option is sort of like deleting unwanted cases (without really deleting them, of course), and the over option is very similar to -by- processing in that the output contains the estimates for each group. We will start with some examples of the subpop option.

First, however, let’s take a second to see why deleting cases from a survey data set can be so problematic. If the data set is subset (meaning that observations not to be included in the subpopulation are deleted from the data set), the standard errors of the estimates cannot be calculated correctly. When the subpopulation option is used, only the cases defined by the subpopulation are used in the calculation of the estimate, but all cases are used in the calculation of the standard errors. For more information on this issue, please see Sampling Techniques, Third Edition by William G. Cochran (1977) and Small Area Estimation by J. N. K. Rao (2003). Also, if you look in the Stata 17 svy manual, you will find an entire section (pages 67-72) dedicated to the analysis of subpopulations. The formulas for using both if and subpop are given, along with an explanation of how they are different. If you look at the help for any svy: command, you will see the same warning:

Warning:  Use of if or in restrictions will not produce correct variance
      estimates for subpopulations in many cases.  To compute estimates for
      subpopulations, use the subpop() option.  The full specification for subpop()
      is
            subpop([varname] [if])

The subpop option on the svy: prefix without the if is used with binary variables. (Technically, all cases coded as not 0 and not missing are part of the subpopulation; this means that if your subpopulation variable has values of 1 and 2, all of the observations will be included in the subpopulation unless you use a different syntax.) You may need to create a binary variable (using the generate command) in which all of the observations be to included in the subpopulation are coded as 1 and all other observations are coded as 0. For our example, we will use the variable female for our subpopulation variable, so that only females will be included in calculation of the point estimate.

svy: mean age
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31            Number of obs   =      10,337
Number of PSUs   = 62            Population size = 117,023,659
                                 Design df       =          31

--------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
-------------+------------------------------------------------
         age |   42.23732   .3034412      41.61844    42.85619
--------------------------------------------------------------

svy, subpop(female): mean age
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31            Number of obs   =      10,337
Number of PSUs   = 62            Population size = 117,023,659
                                 Subpop. no. obs =       5,428
                                 Subpop. size    =  60,901,624
                                 Design df       =          31

--------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
-------------+------------------------------------------------
         age |    42.5498   .3717231      41.79166    43.30793
--------------------------------------------------------------

To get the mean for the males, we can specify the subpopulation as the variable female not equal to 1 (meaning the observations that are coded 0).

svy, subpop(if female != 1): mean age
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31            Number of obs   =      10,337
Number of PSUs   = 62            Population size = 117,023,659
                                 Subpop. no. obs =       4,909
                                 Subpop. size    =  56,122,035
                                 Design df       =          31

--------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
-------------+------------------------------------------------
         age |   41.89822    .339303      41.20621    42.59024
--------------------------------------------------------------

We can also use the over option, which will give the results for each level of the categorical variable listed. The over option is available only for svy: mean, svy: proportion, svy: ratio and svy: total.

svy, over(female): mean age
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31            Number of obs   =      10,337
Number of PSUs   = 62            Population size = 117,023,659
                                 Design df       =          31

--------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
-------------+------------------------------------------------
c.age@female |
       Male  |   41.89822    .339303      41.20621    42.59024
     Female  |    42.5498   .3717231      41.79166    43.30793
--------------------------------------------------------------

We can include more than one variable with the over option; this will give us results for every combination of the variables listed.

svy, over(health female): mean copper
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31                      Number of obs   =       9,116
Number of PSUs   = 62                      Population size = 103,479,298
                                           Design df       =          31

------------------------------------------------------------------------
                       |             Linearized
                       |       Mean   std. err.     [95% conf. interval]
-----------------------+------------------------------------------------
c.copper@health#female |
            Poor#Male  |   125.1899   1.800517      121.5177    128.8621
          Poor#Female  |   140.3425   2.196568      135.8626    144.8224
            Fair#Male  |   118.1246   1.288914      115.4958    120.7534
          Fair#Female  |   137.6154   1.273191      135.0187    140.2121
         Average#Male  |   113.7033   .9745251      111.7158    115.6909
       Average#Female  |   140.9529   1.780904      137.3207    144.5851
            Good#Male  |   108.7135   .6959349      107.2941    110.1328
          Good#Female  |   137.5177   1.374578      134.7142    140.3211
       Excellent#Male  |   106.3076   .5571644      105.1712    107.4439
     Excellent#Female  |   132.7433   1.569208      129.5429    135.9438
------------------------------------------------------------------------

We can also use the subpop option and the over option together. In the example below, we get the mean of copper for females in each combination of race and region (12 levels).

svy, subpop(female): mean copper, over(race region) 
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31                    Number of obs   =       9,677
Number of PSUs   = 62                    Population size = 109,683,584
                                         Subpop. no. obs =       4,768
                                         Subpop. size    =  53,561,549
                                         Design df       =          31

----------------------------------------------------------------------
                     |             Linearized
                     |       Mean   std. err.     [95% conf. interval]
---------------------+------------------------------------------------
c.copper@race#region |
           White#NE  |   133.8487   2.644552      128.4551    139.2423
           White#MW  |   135.3951   1.914987      131.4895    139.3008
            White#S  |   135.4202   1.677846      131.9982    138.8422
            White#W  |   140.4265   2.395216      135.5414    145.3115
           Black#NE  |   149.2318   3.040079      143.0315    155.4321
           Black#MW  |   146.0234   3.755621      138.3638     153.683
            Black#S  |   156.3648   4.922914      146.3245    166.4052
            Black#W  |   145.7466   1.902169      141.8671    149.6261
           Other#NE  |   103.1149   5.877161      91.12837    115.1015
           Other#MW  |   123.9078   5.761588      112.1569    135.6586
            Other#S  |   136.9562   19.24526      97.70521    176.2071
            Other#W  |   120.8002   1.899352      116.9265     124.674
----------------------------------------------------------------------

In some situations, no standard error or confidence intervals will be given. This may be because there are very few observations at that level. To see if this is indeed the cause, we can use the estat size command. The unweighted number of cases is given in the column titled “Obs”, and the weighted (or estimated subpopulation size) is given in the column titled “Size”.

estat size

----------------------------------------------------------------------
             |             Linearized
        Over |       Mean   std. err.              Obs            Size
-------------+--------------------------------------------------------
    c.copper@|
 race#region |
   White#NE  |   133.8487   2.644552               952      10,987,657
   White#MW  |   135.3951   1.914987             1,142      11,954,684
    White#S  |   135.4202   1.677846             1,099      12,006,761
    White#W  |   140.4265   2.395216             1,025      12,587,312
   Black#NE  |   149.2318   3.040079                52         576,302
   Black#MW  |   146.0234   3.755621               120       1,059,794
    Black#S  |   156.3648   4.922914               239       2,572,543
    Black#W  |   145.7466   1.902169                51         540,474
   Other#NE  |   103.1149   5.877161                 6          89,269
   Other#MW  |   123.9078   5.761588                 8         101,797
    Other#S  |   136.9562   19.24526                11         100,557
    Other#W  |   120.8002   1.899352                63         984,399
----------------------------------------------------------------------

The list command can be used to investigate combinations of subpopulation variables where the N is small.

The lincom command can be used to make comparisons between subpopulations. In the first example below, we will compare the means for males and females for iron. We can use the display command to see how the point estimate in the output of the lincom command is calculated. The value of using the lincom command is that the standard error of point estimate is also calculated, as well as the test statistic, p-value and 95% confidence interval. In the following examples, we will compare the iron level between males and females. Please see page 121 of the Stata 17 svy manual (first example in the section on survey post estimation) for more information on using lincom after the svy, subpop(): mean command.

svy, over(female): mean iron
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31             Number of obs   =      10,337
Number of PSUs   = 62             Population size = 117,023,659
                                  Design df       =          31

---------------------------------------------------------------
              |             Linearized
              |       Mean   std. err.     [95% conf. interval]
--------------+------------------------------------------------
c.iron@female |
        Male  |   104.8145   .5595657      103.6733    105.9558
      Female  |   97.17922   .6698075      95.81314     98.5453
---------------------------------------------------------------

We will need to add the coeflegend option to get Stata’s labels for the values we need.

svy, over(female): mean iron, coeflegend
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31            Number of obs   =      10,337
Number of PSUs   = 62            Population size = 117,023,659
                                 Design df       =          31

-------------------------------------------------------------------------------
              |       Mean   Legend
--------------+----------------------------------------------------------------
c.iron@female |
        Male  |   104.8145  _b[c.iron@0bn.female]
      Female  |   97.17922  _b[c.iron@1.female]
-------------------------------------------------------------------------------

lincom _b[c.iron@0bn.female] - _b[c.iron@01.female]

 ( 1)  c.iron@0bn.female - c.iron@1.female = 0

------------------------------------------------------------------------------
        Mean | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         (1) |   7.635327   .7467989    10.22   0.000     6.112221    9.158434
------------------------------------------------------------------------------

display 104.8145 - 97.17922 // a little rounding error
7.63528

In the next example, we compare different regions with respect to mean of copper.

svy, over(region): mean copper
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31               Number of obs   =       9,118
Number of PSUs   = 62               Population size = 103,505,700
                                    Design df       =          31

-----------------------------------------------------------------
                |             Linearized
                |       Mean   std. err.     [95% conf. interval]
----------------+------------------------------------------------
c.copper@region |
            NE  |   122.5009     1.6006      119.2364    125.7653
            MW  |   123.5889   1.381478      120.7714    126.4064
             S  |   128.0201   1.162307      125.6495    130.3906
             W  |   124.3029   1.337444      121.5752    127.0306
-----------------------------------------------------------------

svy, over(region): mean copper, coeflegend
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31            Number of obs   =       9,118
Number of PSUs   = 62            Population size = 103,505,700
                                 Design df       =          31

---------------------------------------------------------------------------------
                |       Mean   Legend
----------------+----------------------------------------------------------------
c.copper@region |
            NE  |   122.5009  _b[c.copper@1bn.region]
            MW  |   123.5889  _b[c.copper@2.region]
             S  |   128.0201  _b[c.copper@3.region]
             W  |   124.3029  _b[c.copper@4.region]
---------------------------------------------------------------------------------

lincom _b[c.copper@4.region] - _b[c.copper@1bn.region] // West - NE

 ( 1)  - c.copper@1bn.region + c.copper@4.region = 0

------------------------------------------------------------------------------
        Mean | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         (1) |   1.802041   2.085828     0.86   0.394    -2.452033    6.056115
------------------------------------------------------------------------------

lincom _b[c.copper@4.region] - _b[c.copper@2.region] // West - MW

 ( 1)  - c.copper@2.region + c.copper@4.region = 0

------------------------------------------------------------------------------
        Mean | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         (1) |   .7140072   1.922821     0.37   0.713    -3.207612    4.635626
------------------------------------------------------------------------------

We can use the svy: total command to get totals. The values in the output can get to be very large, so we can use the estimates table command to see them. We can also use the matrix list command.

svy: total diabetes
(running total on estimation sample)

Survey: Total estimation

Survey: Total estimation

Number of strata = 31            Number of obs   =      10,335
Number of PSUs   = 62            Population size = 116,997,257
                                 Design df       =          31

--------------------------------------------------------------
             |             Linearized
             |      Total   std. err.     [95% conf. interval]
-------------+------------------------------------------------
    diabetes |    4011281   233292.8       3535477     4487085
--------------------------------------------------------------

estimates table, b(%15.2f) se(%13.2f)

--------------------------------
    Variable |     Active       
-------------+------------------
    diabetes |      4011281.00  
             |       233292.85  
--------------------------------
                    Legend: b/se

svy: total diabetes
(running total on estimation sample)

Survey: Total estimation

Number of strata = 31            Number of obs   =      10,335
Number of PSUs   = 62            Population size = 116,997,257
                                 Design df       =          31

--------------------------------------------------------------
             |             Linearized
             |      Total   std. err.     [95% conf. interval]
-------------+------------------------------------------------
    diabetes |    4011281   233292.8       3535477     4487085
--------------------------------------------------------------

matlist e(b), format(%15.2f)

             |        diabetes 
-------------+-----------------
          y1 |      4011281.00 

svy, over(female): total diabetes
(running total on estimation sample)

Survey: Total estimation

Number of strata = 31                 Number of obs   =      10,335
Number of PSUs   = 62                 Population size = 116,997,257
                                      Design df       =          31

-------------------------------------------------------------------
                  |             Linearized
                  |      Total   std. err.     [95% conf. interval]
------------------+------------------------------------------------
c.diabetes@female |
            Male  |    1636958   159586.3       1311480     1962436
          Female  |    2374323   167874.1       2031941     2716705
-------------------------------------------------------------------

estimates table, b(%15.2f) se(%13.2f)

--------------------------------
    Variable |     Active       
-------------+------------------
  c.diabetes@|
      female |
       Male  |      1636958.00  
             |       159586.34  
     Female  |      2374323.00  
             |       167874.15  
--------------------------------
                    Legend: b/se

Graphing with continuous variables

We can also get some descriptive graphs of our variables. For a continuous variable, you may want a histogram. However, the histogram command will only accept a frequency weight, which, by definition, can have only integer values. A suggestion by Heeringa, West and Berglund (2010, pages 121-122) is to simply use the integer part of the sampling weight. We can create a frequency weight from our sampling weight using the generate command with the int function.

gen int_finalwgt = int(finalwgt)
histogram copper [fw = int_finalwgt], bin(20)

histogram age [fw = int_finalwgt], bin(20) normal

We can make box plots and use the sampling weight that is provided in the data set.

graph box hct [pw = finalwgt]

graph box hct [pw = finalwgt], by(female)

The line in the box plot represents the median, not the mean.

svy, over(female): mean hct
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31            Number of obs   =      10,337
Number of PSUs   = 62            Population size = 117,023,659
                                 Design df       =          31

--------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
-------------+------------------------------------------------
c.hct@female |
       Male  |   44.21309   .0730382      44.06412    44.36205
     Female  |   39.92893   .0857013      39.75414    40.10372
--------------------------------------------------------------

estat sd

-------------------------------------
        Over |       Mean   Std. dev.
-------------+-----------------------
c.hct@female |
       Male  |   44.21309    2.891824
     Female  |   39.92893     3.02211
-------------------------------------

There is no survey command in Stata that will give us the median, but we can use a “hack” and specify the sampling weight as an analytic weight. (Please see https://www.stata.com/support/faqs/statistics/percentiles-for-survey-data/ for more information.)

bysort female: summarize hct [aw = finalwt], detail

----------------------------------------------------------------------------------------------------------------------------
-> female = Male

                       Hematocrit (%)
-------------------------------------------------------------
      Percentiles      Smallest
 1%           37           20.2
 5%         39.5           22.5
10%         40.7             23       Obs               4,909
25%         42.2           27.2       Sum of wgt.    56122035

50%         44.2                      Mean           44.21309
                        Largest       Std. dev.      2.906201
75%           46           57.2
90%         47.7           57.5       Variance       8.446005
95%         48.7           57.5       Skewness      -.2544314
99%           51           60.7       Kurtosis       5.152102

----------------------------------------------------------------------------------------------------------------------------
-> female = Female

                       Hematocrit (%)
-------------------------------------------------------------
      Percentiles      Smallest
 1%           33           23.2
 5%           35           24.5
10%         36.2             26       Obs               5,428
25%           38             27       Sum of wgt.    60901624

50%           40                      Mean           39.92893
                        Largest       Std. dev.      3.008737
75%           42             54
90%         43.7           57.7       Variance       9.052499
95%           45           58.5       Skewness       .0781784
99%           47           59.5       Kurtosis       4.524747

Relationships between two continuous variables

Let’s look at some bivariate relationships. As of Stata 17, there is not a convenience command to get correlations (but there is in SUDAAN 11).

To get a weighed correlation in Stata, we can use the svy: sem command with the standardize option. Clearly, this is not an intuitive way to get a correlation coefficient! The idea here is that a correlation is a standardized covariance estimate. One advantage to calculating the correlation coefficient this way is that the output contains not only the correlation coefficient and the associated p-value, but also the standard error. In the output shown below, we are interested only in the last line. The correlation coefficient is 0.0281931, and its standard error is 0.009628.

svy: sem (<- diabetes copper), standardize
(running sem on estimation sample)

Survey: Structural equation model                Number of obs   =       9,116
Number of strata = 31                            Population size = 103,479,298
Number of PSUs   = 62                            Design df       =          31

-------------------------------------------------------------------------------------
                    |             Linearized
       Standardized | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
--------------------+----------------------------------------------------------------
      mean(diabetes)|   .1887943   .0059407    31.78   0.000     .1766782    .2009105
        mean(copper)|   3.667598   .0656603    55.86   0.000     3.533683    3.801513
--------------------+----------------------------------------------------------------
       var(diabetes)|          1          .                             .           .
         var(copper)|          1          .                             .           .
--------------------+----------------------------------------------------------------
cov(diabetes,copper)|   .0281931    .009628     2.93   0.006     .0085565    .0478296
-------------------------------------------------------------------------------------

svy: sem (<- bpsystol bpdiast), standardize
(running sem on estimation sample)

Survey: Structural equation model                Number of obs   =      10,337
Number of strata = 31                            Population size = 117,023,659
Number of PSUs   = 62                            Design df       =          31

--------------------------------------------------------------------------------------
                     |             Linearized
        Standardized | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
---------------------+----------------------------------------------------------------
       mean(bpsystol)|   5.930388   .0823673    72.00   0.000     5.762398    6.098377
        mean(bpdiast)|   6.331351   .1097803    57.67   0.000     6.107453    6.555249
---------------------+----------------------------------------------------------------
        var(bpsystol)|          1          .                             .           .
         var(bpdiast)|          1          .                             .           .
---------------------+----------------------------------------------------------------
cov(bpsystol,bpdiast)|    .695335   .0073571    94.51   0.000     .6803301    .7103398
--------------------------------------------------------------------------------------

The graphical version of a correlation is a scatterplot, which we show below. We will include the fit line.

svy: mean bpsystol bpdiast
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31            Number of obs   =      10,337
Number of PSUs   = 62            Population size = 117,023,659
                                 Design df       =          31

--------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
-------------+------------------------------------------------
    bpsystol |   126.9476   .6040197      125.7157    128.1795
     bpdiast |   81.02173   .5086788      79.98427    82.05918
--------------------------------------------------------------

twoway (scatter bpsystol bpdiast [pw = finalwgt]) (lfit bpsystol bpdiast [pw = finalwgt]), ///
title("Systolic Blood Pressure" "v. Diastolic Blood Pressure")

Descriptive statistics with categorical variables

Let’s get some descriptive statistics with categorical variables. We can use the svy: tabulate and svy: proportion commands. We will use the region variable, region, which has four levels. When used with no options, the output from the svy: tab command will give us the same information as the svy: proportion command.

svy: tab region
(running tabulate on estimation sample)

Number of strata = 31                            Number of obs   =      10,337
Number of PSUs   = 62                            Population size = 117,023,659
                                                 Design df       =          31

----------------------
   Region | proportion
----------+-----------
       NE |      .2063
       MW |      .2492
        S |      .2656
        W |      .2789
          | 
    Total |          1
----------------------
Key: proportion = Cell proportion

svy: proportion region
(running proportion on estimation sample)

Survey: Proportion estimation

Number of strata = 31            Number of obs   =      10,337
Number of PSUs   = 62            Population size = 117,023,659
                                 Design df       =          31

--------------------------------------------------------------
             |             Linearized            Logit
             | Proportion   std. err.     [95% conf. interval]
-------------+------------------------------------------------
      region |
         NE  |   .2063353   .0056671      .1950164    .2181331
         MW  |    .249163   .0061898      .2367536    .2619997
          S  |   .2655949   .0103955      .2449398    .2873289
          W  |   .2789068   .0113228      .2564096    .3025747
--------------------------------------------------------------

Let’s use some options with the svy: tab command so that we can see the estimated number of people in each category.

svy: tab region, count cellwidth(12) format(%12.2g)
(running tabulate on estimation sample)

(running tabulate on estimation sample)

Number of strata = 31                            Number of obs   =      10,337
Number of PSUs   = 62                            Population size = 117,023,659
                                                 Design df       =          31

------------------------
   Region |        count
----------+-------------
       NE |     24146107
       MW |     29157970
        S |     31080884
        W |     32638698
          | 
    Total |    117023659
------------------------
Key:        count = Weighted count

There are many options that can by used with svy: tab. Please see the Stata help file for the svy: tabulate command for a complete listing and description of each option. Only five items can be displayed at once, and the ci option counts as two items.

svy: tab region, cell count obs cellwidth(12) format(%12.2g) 
(running tabulate on estimation sample)

Number of strata = 31                            Number of obs   =      10,337
Number of PSUs   = 62                            Population size = 117,023,659
                                                 Design df       =          31

----------------------------------------------------
   Region |        count    proportion           obs
----------+-----------------------------------------
       NE |     24146107           .21          2086
       MW |     29157970           .25          2773
        S |     31080884           .27          2853
        W |     32638698           .28          2625
          | 
    Total |    117023659             1         10337
----------------------------------------------------
Key:        count = Weighted count
       proportion = Cell proportion
              obs = Number of observations

svy: tab region, count se cellwidth(15) format(%15.2g)
(running tabulate on estimation sample)

Number of strata = 31                            Number of obs   =      10,337
Number of PSUs   = 62                            Population size = 117,023,659
                                                 Design df       =          31

--------------------------------------------
   Region |           count               se
----------+---------------------------------
       NE |        24146107           569293
       MW |        29157970           566351
        S |        31080884          1508325
        W |        32638698          1714567
          | 
    Total |       117023659                 
--------------------------------------------
Key:           count = Weighted count
                  se = Linearized standard error of weighted count

svy: tab region, count deff deft cv cellwidth(12) format(%12.2g) 
(running tabulate on estimation sample)

Number of strata = 31                            Number of obs   =      10,337
Number of PSUs   = 62                            Population size = 117,023,659
                                                 Design df       =          31

--------------------------------------------
   Region |           count               se
----------+---------------------------------
       NE |        24146107           569293
       MW |        29157970           566351
        S |        31080884          1508325
        W |        32638698          1714567
          | 
    Total |       117023659                 
--------------------------------------------
Key:           count = Weighted count
                  se = Linearized standard error of weighted count

Chi-square tests are provided by default when svy: tab is issued with two variables. You will usually want to use the design-based test.

The proportion of observations in each cell can be obtained using either the svy: tab or the svy: proportion command. If you want to compare the proportions within two cells, you can use the lincom command.

svy: tab agegrp female, cell obs count cellwidth(12) format(%12.2g)
(running tabulate on estimation sample)

Number of strata = 31                            Number of obs   =      10,337
Number of PSUs   = 62                            Population size = 117,023,659
                                                 Design df       =          31

----------------------------------------------------
          |                  Female                 
Age group |         Male        Female         Total
----------+-----------------------------------------
    20–29 |     15918998      16938699      32857697
          |          .14           .14           .28
          |         1116          1204          2320
          | 
    30–39 |     11777252      12150570      23927822
          |           .1            .1            .2
          |          770           851          1621
          | 
    40–49 |      9538701      10152242      19690943
          |         .082          .087           .17
          |          610           660          1270
          | 
    50–59 |      9210429      10337703      19548132
          |         .079          .088           .17
          |          601           688          1289
          | 
    60–69 |      7326861       8261777      15588638
          |         .063          .071           .13
          |         1365          1487          2852
          | 
      70+ |      2349794       3060633       5410427
          |          .02          .026          .046
          |          447           538           985
          | 
    Total |     56122035      60901624     117023659
          |          .48           .52             1
          |         4909          5428         10337
----------------------------------------------------
Key: Weighted count
     Cell proportion
     Number of observations

  Pearson:
    Uncorrected   chi2(5)         =    6.7106
    Design-based  F(3.91, 121.15) =    1.2445     P = 0.2960

svy: proportion agegrp, over(female) 
(running proportion on estimation sample)

Survey: Proportion estimation

Number of strata = 31             Number of obs   =      10,337
Number of PSUs   = 62             Population size = 117,023,659
                                  Design df       =          31

---------------------------------------------------------------
              |             Linearized            Logit
              | Proportion   std. err.     [95% conf. interval]
--------------+------------------------------------------------
agegrp@female |
  20–29 Male  |   .2836497   .0095365      .2646088    .3034953
20–29 Female  |   .2781321   .0085494      .2610357    .2958999
  30–39 Male  |   .2098508   .0086259      .1927991    .2279846
30–39 Female  |   .1995114   .0059814      .1875917    .2119909
  40–49 Male  |   .1699636   .0059311      .1582069    .1824046
40–49 Female  |    .166699   .0051213      .1565142    .1774073
  50–59 Male  |   .1641143   .0070436      .1502492    .1789894
50–59 Female  |   .1697443   .0068912      .1561486    .1842653
  60–69 Male  |   .1305523   .0051017       .120495    .1413141
60–69 Female  |   .1356577   .0040929      .1275246    .1442239
    70+ Male  |   .0418694   .0029631      .0362262    .0483475
  70+ Female  |   .0502554   .0038064      .0430362    .0586114
---------------------------------------------------------------

svy: proportion agegrp, over(female) coeflegend
(running proportion on estimation sample)

Survey: Proportion estimation

Number of strata = 31            Number of obs   =      10,337
Number of PSUs   = 62            Population size = 117,023,659
                                 Design df       =          31

-------------------------------------------------------------------------------
              | Proportion   Legend
--------------+----------------------------------------------------------------
agegrp@female |
  20–29 Male  |   .2836497  _b[1bn.agegrp@0bn.female]
20–29 Female  |   .2781321  _b[1bn.agegrp@1.female]
  30–39 Male  |   .2098508  _b[2.agegrp@0bn.female]
30–39 Female  |   .1995114  _b[2.agegrp@1.female]
  40–49 Male  |   .1699636  _b[3.agegrp@0bn.female]
40–49 Female  |    .166699  _b[3.agegrp@1.female]
  50–59 Male  |   .1641143  _b[4.agegrp@0bn.female]
50–59 Female  |   .1697443  _b[4.agegrp@1.female]
  60–69 Male  |   .1305523  _b[5.agegrp@0bn.female]
60–69 Female  |   .1356577  _b[5.agegrp@1.female]
    70+ Male  |   .0418694  _b[6.agegrp@0bn.female]
  70+ Female  |   .0502554  _b[6.agegrp@1.female]
-------------------------------------------------------------------------------

lincom _b[1bn.agegrp@0bn.female] - _b[1bn.agegrp@1.female] // (20-29 male) - (20-29 female)

( 1)  1bn.agegrp@0bn.female - 1bn.agegrp@1.female = 0

------------------------------------------------------------------------------
  Proportion | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         (1) |   .0055176   .0108797     0.51   0.616    -.0166718    .0277069
------------------------------------------------------------------------------

display .2836497 - .2781321
.0055176

Graphs with categorical variables

Let’s create a bar graph of the variable female. This graph will show the percent of observations that are female and male. To do this, we will need to create a new variable, which we will call male; it will be the opposite of female.

gen male = !female
graph bar (mean) female male [pw = finalwgt], percentages bargap(7)

We can also graph the mean of the variable copper by each level of race.

svy: mean copper, over(race)
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31             Number of obs   =       9,118
Number of PSUs   = 62             Population size = 103,505,700
                                  Design df       =          31

---------------------------------------------------------------
              |             Linearized
              |       Mean   std. err.     [95% conf. interval]
--------------+------------------------------------------------
c.copper@race |
       White  |   123.7236   .6372616      122.4239    125.0233
       Black  |   139.0427   2.222075      134.5107    143.5746
       Other  |    110.714   1.075104      108.5213    112.9067
---------------------------------------------------------------

graph hbar copper [pw = finalwgt], over(race, gap(*2)) ///
title("Copper by race")

Finally, we will graph the mean of hct for each level of region.

svy: mean hct, over(region)
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31            Number of obs   =      10,337
Number of PSUs   = 62            Population size = 117,023,659
                                 Design df       =          31

--------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
-------------+------------------------------------------------
c.hct@region |
         NE  |    42.0485   .1610878      41.71996    42.37704
         MW  |   41.85997   .1606116       41.5324    42.18754
          S  |   42.05992   .1506741      41.75262    42.36722
          W  |   41.97306   .1361827      41.69532    42.25081
--------------------------------------------------------------

graph bar hct [pw = finalwgt], over(region, label(angle(45))) ///
title("Hemocrit by region")

OLS regression

Now that we have descriptive statistics on our variables, we may want to run some inferential statistics, such as OLS regression or logistic regression. As before, we simply need to use the -svy- prefix before our regression commands.

Please note that the following analyses are shown only as examples of how to do these analyses in Stata. There was no attempt to create substantively meaningful models. Rather, the variables were chosen for illustrative purposes only. We do not recommend that researchers create their models this way.

We will start with an OLS regression with one categorical predictor (female). This is equivalent to a t-test. We will use the margins command get the means for each group. The vce(unconditional) option is used on all of the margins commands. This is the entry for this option from the Stata documentation:

vce(unconditional) specifies that the covariates that are not fixed be treated in a way that accounts for their having been sampled. The VCE is estimated using the linearization method. This method allows for heteroskedasticity or other violations of distributional assumptions and allows for correlation among the observations in the same manner as vce(robust) and vce(cluster…), which may have been specified with the estimation command. This method also accounts for complex survey designs if the data are svyset. See Obtaining margins with survey data and representative samples. When you use complex survey data, this method requires that the linearized variance estimation method be used for the model. See [SVY]svypostestimation for an example of margins with replication-based methods.

The coeflegend option is also used; this is a handy option that can be used with many Stata estimation commands and gives the labels that Stata assigns to these values. You use these labels to refer to the values when using commands such as lincom.

NOTE: If a regression is run on a subpopulation, the subpopulation must be specified as an option on the margins command if the desired output from margins is to the subpopulation.

svy: regress copper i.female
(running regress on estimation sample)

Survey: Linear regression

Number of strata = 31                            Number of obs   =       9,118
Number of PSUs   = 62                            Population size = 103,505,700
                                                 Design df       =          31
                                                 F(1, 31)        =      613.75
                                                 Prob > F        =      0.0000
                                                 R-squared       =      0.1489

------------------------------------------------------------------------------
             |             Linearized
      copper | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      female |
     Female  |   26.26294   1.060099    24.77   0.000     24.10085    28.42502
       _cons |   111.1328   .5685644   195.46   0.000     109.9732    112.2924
------------------------------------------------------------------------------

margins female, vce(unconditional)

Adjusted predictions

Number of strata = 31                            Number of obs   =       9,118
Number of PSUs   = 62                            Population size = 103,505,700
                                                 Design df       =          31

Expression: Linear prediction, predict()

------------------------------------------------------------------------------
             |             Linearized
             |     Margin   std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      female |
       Male  |   111.1328   .5685644   195.46   0.000     109.9732    112.2924
     Female  |   137.3958   1.051281   130.69   0.000     135.2517    139.5399
------------------------------------------------------------------------------

display 137.3958 - 111.1328
26.263

margins female, vce(unconditional) coeflegend post // the post option is necessary when using the coeflegend option

Adjusted predictions

Number of strata = 31                            Number of obs   =       9,118
Number of PSUs   = 62                            Population size = 103,505,700
                                                 Design df       =          31

Expression: Linear prediction, predict()

------------------------------------------------------------------------------
             |     Margin   Legend
-------------+----------------------------------------------------------------
      female |
       Male  |   111.1328  _b[0bn.female]
     Female  |   137.3958  _b[1.female]
------------------------------------------------------------------------------

lincom _b[1.female] - _b[0bn.female]

 ( 1)  - 0bn.female + 1.female = 0

------------------------------------------------------------------------------
             | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         (1) |   26.26294   1.060099    24.77   0.000     24.10085    28.42502
------------------------------------------------------------------------------

We can run the same anlaysis using the svy: mean and lincom commands.

svy, over(female): mean copper
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31               Number of obs   =       9,118
Number of PSUs   = 62               Population size = 103,505,700
                                    Design df       =          31

-----------------------------------------------------------------
                |             Linearized
                |       Mean   std. err.     [95% conf. interval]
----------------+------------------------------------------------
c.copper@female |
          Male  |   111.1328   .5685644      109.9732    112.2924
        Female  |   137.3958   1.051281      135.2517    139.5399
-----------------------------------------------------------------

svy, over(female): mean copper, coeflegend
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31            Number of obs   =       9,118
Number of PSUs   = 62            Population size = 103,505,700
                                 Design df       =          31

---------------------------------------------------------------------------------
                |       Mean   Legend
----------------+----------------------------------------------------------------
c.copper@female |
          Male  |   111.1328  _b[c.copper@0bn.female]
        Female  |   137.3958  _b[c.copper@1.female]
---------------------------------------------------------------------------------

lincom _b[c.copper@1.female] - _b[c.copper@0bn.female]

 ( 1)  - c.copper@0bn.female + c.copper@1.female = 0

------------------------------------------------------------------------------
        Mean | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         (1) |   26.26294   1.060099    24.77   0.000     24.10085    28.42502
------------------------------------------------------------------------------

Let’s return to running linear regression models. We will add a continuous predictor to our OLS regression, so now the model has one categorical predictor (female) and one continuous predictor (<bage). In the output we see both the number of observations used in and the estimate of the size of the population. We see the overall test of the model and the R-squared for the model. The R-squared is the population R-squared and can be thought of as the adjusted R-squared. We will follow the svy: regress command with the margins command, which gives us the predicted values for each level of female.

svy: regress copper i.female age
(running regress on estimation sample)

Survey: Linear regression

Number of strata = 31                            Number of obs   =       9,118
Number of PSUs   = 62                            Population size = 103,505,700
                                                 Design df       =          31
                                                 F(2, 30)        =      448.23
                                                 Prob > F        =      0.0000
                                                 R-squared       =      0.1497

------------------------------------------------------------------------------
             |             Linearized
      copper | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      female |
     Female  |   26.20177   1.087402    24.10   0.000       23.984    28.41954
         age |   .0635486   .0330805     1.92   0.064    -.0039196    .1310168
       _cons |   108.4748   1.540579    70.41   0.000     105.3327    111.6168
------------------------------------------------------------------------------

margins female, vce(unconditional)

Predictive margins

Number of strata = 31                            Number of obs   =       9,118
Number of PSUs   = 62                            Population size = 103,505,700
                                                 Design df       =          31

Expression: Linear prediction, predict()

------------------------------------------------------------------------------
             |             Linearized
             |     Margin   std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      female |
       Male  |   111.1645   .5718949   194.38   0.000     109.9981    112.3309
     Female  |   137.3663   1.062916   129.24   0.000     135.1984    139.5341
------------------------------------------------------------------------------

Now let’s run a model with a categorical by categorical (specifically a binary by binary) interaction term. We can use the contrast command to get the test of the interaction displayed as an F-test rather than a t-statistic. Of course, the p-value is exactly the same as the one shown in the table of coefficients because the test of a binary by binary interaction is a one degree-of-freedom test. We can use the margins command to get the predicted values of copper for each combination of female and diabetes. The marginsplot command is used to obtain a graph of the interaction (marginsplot graphs the values shown in the output from the margins command).

svy: regress copper i.female##i.diabetes age
(running regress on estimation sample)

Survey: Linear regression

Number of strata = 31                            Number of obs   =       9,116
Number of PSUs   = 62                            Population size = 103,479,298
                                                 Design df       =          31
                                                 F(4, 28)        =      315.94
                                                 Prob > F        =      0.0000
                                                 R-squared       =      0.1505

----------------------------------------------------------------------------------
                 |             Linearized
          copper | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-----------------+----------------------------------------------------------------
          female |
         Female  |   26.42748   1.091601    24.21   0.000     24.20114    28.65381
                 |
        diabetes |
       Diabetic  |   6.350642   2.034884     3.12   0.004     2.200469    10.50082
                 |
 female#diabetes |
Female#Diabetic  |  -6.493926   2.706082    -2.40   0.023    -12.01302   -.9748357
                 |
             age |   .0573067   .0332424     1.72   0.095    -.0104917    .1251051
           _cons |   108.5489   1.542311    70.38   0.000     105.4034    111.6945
----------------------------------------------------------------------------------

contrast female#diabetes, effects

Contrasts of marginal linear predictions

                                                                Design df = 31

Margins: asbalanced

---------------------------------------------------
                |         df           F        P>F
----------------+----------------------------------
female#diabetes |          1        5.76     0.0226
         Design |         31
---------------------------------------------------
Note: F statistics are adjusted for the survey
      design.

------------------------------------------------------------------------------------------------------
                                     |   Contrast   Std. err.      t    P>|t|     [95% conf. interval]
-------------------------------------+----------------------------------------------------------------
                     female#diabetes |
(Female vs base) (Diabetic vs base)  |  -6.493926   2.706082    -2.40   0.023    -12.01302   -.9748357
------------------------------------------------------------------------------------------------------

margins female#diabetes, vce(unconditional)

Predictive margins

Number of strata = 31                            Number of obs   =       9,116
Number of PSUs   = 62                            Population size = 103,479,298
                                                 Design df       =          31

Expression: Linear prediction, predict()

--------------------------------------------------------------------------------------
                     |             Linearized
                     |     Margin   std. err.      t    P>|t|     [95% conf. interval]
---------------------+----------------------------------------------------------------
     female#diabetes |
  Male#Not diabetic  |   110.9745   .5531673   200.62   0.000     109.8463    112.1027
      Male#Diabetic  |   117.3252   2.201312    53.30   0.000     112.8356    121.8148
Female#Not diabetic  |    137.402   1.077068   127.57   0.000     135.2053    139.5987
    Female#Diabetic  |   137.2587   2.417731    56.77   0.000     132.3277    142.1897
-------------------------------------------------------------------------------------

marginsplot

  Variables that uniquely identify margins: female diabetes

Some of you may recognize this analysis as a “difference-in-differences” analysis. In other words, the term of interest is the binary-by-binary interaction term. Let’s reproduce the coefficient of this interaction term, -6.493926, by taking the difference of the differences ((male_no – male_yes) -(female_no – female_yes)). First, we need to need to rerun the margins command with the coeflegend option and the post option; the post option is needed so that we can access the values remembered by margins with another command. The output will show us the names of the values. We will use these names rather than the actual values themselves, so that if the data or model were changed, the code would not need to be modified. We will use the lincom command to do the math for us. The advantage of using the lincom command over the display command is that lincom will also produce the standard error.

margins female#diabetes, vce(unconditional)

Predictive margins

Number of strata = 31                            Number of obs   =       9,116
Number of PSUs   = 62                            Population size = 103,479,298
                                                 Design df       =          31

Expression: Linear prediction, predict()

--------------------------------------------------------------------------------------
                     |             Linearized
                     |     Margin   std. err.      t    P>|t|     [95% conf. interval]
---------------------+----------------------------------------------------------------
     female#diabetes |
  Male#Not diabetic  |   110.9745   .5531673   200.62   0.000     109.8463    112.1027
      Male#Diabetic  |   117.3252   2.201312    53.30   0.000     112.8356    121.8148
Female#Not diabetic  |    137.402   1.077068   127.57   0.000     135.2053    139.5987
    Female#Diabetic  |   137.2587   2.417731    56.77   0.000     132.3277    142.1897
--------------------------------------------------------------------------------------

margins, vce(unconditional) coeflegend post

Predictive margins

Number of strata = 31                            Number of obs   =       9,116
Number of PSUs   = 62                            Population size = 103,479,298
                                                 Design df       =          31

Expression: Linear prediction, predict()

--------------------------------------------------------------------------------------
                     |             Linearized
                     |     Margin   std. err.      t    P>|t|     [95% conf. interval]
---------------------+----------------------------------------------------------------
     female#diabetes |
  Male#Not diabetic  |   110.9745   .5531673   200.62   0.000     109.8463    112.1027
      Male#Diabetic  |   117.3252   2.201312    53.30   0.000     112.8356    121.8148
Female#Not diabetic  |    137.402   1.077068   127.57   0.000     135.2053    139.5987
    Female#Diabetic  |   137.2587   2.417731    56.77   0.000     132.3277    142.1897
--------------------------------------------------------------------------------------

lincom (_b[0bn.female#0bn.diabetes]-_b[0bn.female#1.diabetes]) - (_b[1.female#0bn.diabetes]-_b[1.female#1.diabetes]) 

 ( 1)  0bn.female#0bn.diabetes - 0bn.female#1.diabetes - 1.female#0bn.diabetes + 1.female#1.diabetes = 0

------------------------------------------------------------------------------
             | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         (1) |  -6.493926   2.706082    -2.40   0.023    -12.01302   -.9748357
------------------------------------------------------------------------------

We can see that the output from the lincom command is the same as the output for the interaction term in the original regression command.

Looking back at the graph of the interaction, you may wonder if the two points for male are statistically significantly different from one another, and the same for the two points for females. The lincom command can be used for these comparisons.

* males
lincom (_b[0bn.female#0bn.diabetes]-_b[0bn.female#1.diabetes])

 ( 1)  0bn.female#0bn.diabetes - 0bn.female#1.diabetes = 0

------------------------------------------------------------------------------
             | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         (1) |  -6.350642   2.034884    -3.12   0.004    -10.50082   -2.200469
------------------------------------------------------------------------------

* females
lincom (_b[1.female#0bn.diabetes]-_b[1.female#1.diabetes])

 ( 1)  1.female#0bn.diabetes - 1.female#1.diabetes = 0

------------------------------------------------------------------------------
             | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         (1) |   .1432847    2.42126     0.06   0.953    -4.794907    5.081476
------------------------------------------------------------------------------

In the next example, we will use a categorical by continuous interaction. In the first call to the margins command, we get the simple slope coefficients for males and females. The p-values in this table tell us that both slopes are significantly different from 0. In the second call to the margins command, we get the predicted values for each level of female at height = 140, height = 150, height = 160, up to height = 190. The t-statistics and corresponding p-values tell us if the predicted value is different from 0.

svy: regress bpsystol i.female##c.height
(running regress on estimation sample)

Survey: Linear regression

Number of strata = 31                            Number of obs   =      10,337
Number of PSUs   = 62                            Population size = 117,023,659
                                                 Design df       =          31
                                                 F(3, 29)        =       71.21
                                                 Prob > F        =      0.0000
                                                 R-squared       =      0.0309

---------------------------------------------------------------------------------
                |             Linearized
       bpsystol | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
----------------+----------------------------------------------------------------
         female |
        Female  |   43.61115   13.30799     3.28   0.003     16.46933    70.75298
         height |  -.1647344   .0518262    -3.18   0.003    -.2704347   -.0590342
                |
female#c.height |
        Female  |  -.3185637   .0785976    -4.05   0.000    -.4788646   -.1582627
                |
          _cons |   158.8447   9.148233    17.36   0.000     140.1868    177.5026
---------------------------------------------------------------------------------

margins female, dydx(height) vce(unconditional)

Average marginal effects

Number of strata = 31                            Number of obs   =      10,337
Number of PSUs   = 62                            Population size = 117,023,659
                                                 Design df       =          31

Expression: Linear prediction, predict()
dy/dx wrt:  height

------------------------------------------------------------------------------
             |             Linearized
             |      dy/dx   std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
height       |
      female |
       Male  |  -.1647344   .0518262    -3.18   0.003    -.2704347   -.0590342
     Female  |  -.4832981   .0550113    -8.79   0.000    -.5954944   -.3711018
------------------------------------------------------------------------------

margins female, at(height=(140(10)190)) vsquish vce(unconditional)

Adjusted predictions

Number of strata = 31                            Number of obs   =      10,337
Number of PSUs   = 62                            Population size = 117,023,659
                                                 Design df       =          31

Expression: Linear prediction, predict()
1._at: height = 140
2._at: height = 150
3._at: height = 160
4._at: height = 170
5._at: height = 180
6._at: height = 190

------------------------------------------------------------------------------
             |             Linearized
             |     Margin   std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
  _at#female |
     1#Male  |   135.7819   1.978535    68.63   0.000     131.7466    139.8171
   1#Female  |   134.7941   1.501944    89.75   0.000     131.7309    137.8574
     2#Male  |   134.1345   1.498425    89.52   0.000     131.0785    137.1906
   2#Female  |   129.9611   1.046223   124.22   0.000     127.8274    132.0949
     3#Male  |   132.4872   1.055057   125.57   0.000     130.3354     134.639
   3#Female  |   125.1282   .7338792   170.50   0.000     123.6314    126.6249
     4#Male  |   130.8399   .7198653   181.76   0.000     129.3717     132.308
   4#Female  |   120.2952   .7666958   156.90   0.000     118.7315    121.8589
     5#Male  |   129.1925   .6785699   190.39   0.000     127.8086    130.5765
   5#Female  |   115.4622   1.114592   103.59   0.000      113.189    117.7354
     6#Male  |   127.5452   .9694841   131.56   0.000     125.5679    129.5224
   6#Female  |   110.6292   1.581789    69.94   0.000     107.4031    113.8553
------------------------------------------------------------------------------
marginsplot
Variables that uniquely identify margins: height female

In the example below, we get the difference between the predicted values for males and females at six values of height. We then use the marginsplot command to graph these differences. This graph shows that the difference between males and females is larger at the extreme values of height, as is the variability around those estimates.

Notice that in the output from the margins command above, the predicted value for 4#female is 120.2952 and the predicted value for 4#male is 130.8399. 120.2952-130.8399 = -10.5447, which is the value of dy/dx on line 4 of the output below.

margins, dydx(female) at(height=(140(10)190)) vsquish vce(unconditional)

Conditional marginal effects

Number of strata = 31                            Number of obs   =      10,337
Number of PSUs   = 62                            Population size = 117,023,659
                                                 Design df       =          31

Expression: Linear prediction, predict()
dy/dx wrt:  1.female
1._at: height = 140
2._at: height = 150
3._at: height = 160
4._at: height = 170
5._at: height = 180
6._at: height = 190

------------------------------------------------------------------------------
             |             Linearized
             |      dy/dx   std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
0.female     |  (base outcome)
-------------+----------------------------------------------------------------
1.female     |
         _at |
          1  |  -.9877624   2.392115    -0.41   0.683    -5.866514    3.890989
          2  |  -4.173399   1.657451    -2.52   0.017    -7.553793   -.7930051
          3  |  -7.359036   1.003788    -7.33   0.000    -9.406275   -5.311796
          4  |  -10.54467   .7096161   -14.86   0.000    -11.99194   -9.097401
          5  |  -13.73031   1.111322   -12.35   0.000    -15.99687   -11.46375
          6  |  -16.91595   1.789424    -9.45   0.000     -20.5655   -13.26639
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

marginsplot, yline(0)

  Variables that uniquely identify margins: height

Let’s remake this graph with confidence interval as an area plot.

marginsplot, yline(0) recast(line) recastci(rarea) ciopts(color(*.2))

  Variables that uniquely identify margins: height

If we use a categorical predictor variable in our model that has more than two levels, we may want to run comparisons between each of the levels. In the example below, we use the region variable, region, as a predictor. We use the contrast command to determine if, taken together, region, is a statistically significant predictor of our outcome variable, copper. Next, we use the pwcompare command to conduct all pairwise comparisons. We use the mcompare(sidak) option to adjust the p-values for the multiple comparisons, and we use the pveffects option to have the test statistics and p-values shown in the output. Other options that we could have used with the mcompare option are bonferroni and scheffe.

svy: regress copper i.region age
(running regress on estimation sample)

Survey: Linear regression

Number of strata = 31                            Number of obs   =       9,118
Number of PSUs   = 62                            Population size = 103,505,700
                                                 Design df       =          31
                                                 F(4, 28)        =        3.93
                                                 Prob > F        =      0.0118
                                                 R-squared       =      0.0054

------------------------------------------------------------------------------
             |             Linearized
      copper | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      region |
         MW  |   1.222524   2.098883     0.58   0.564    -3.058175    5.503223
          S  |   5.527657    1.97134     2.80   0.009     1.507083    9.548231
          W  |   1.951666   2.093714     0.93   0.358    -2.318492    6.221824
             |
         age |   .0878814   .0359642     2.44   0.020     .0145319    .1612308
       _cons |    118.706   2.487797    47.72   0.000     113.6322    123.7799
------------------------------------------------------------------------------

contrast region

Contrasts of marginal linear predictions

                                                                Design df = 31

Margins: asbalanced

------------------------------------------------
             |         df           F        P>F
-------------+----------------------------------
      region |          3        3.23     0.0368
      Design |         31
------------------------------------------------
Note: F statistics are adjusted for the survey
      design.

pwcompare region, mcompare(sidak) cformat(%3.1f) pveffects

Pairwise comparisons of marginal linear predictions

                                                                Design df = 31

Margins: asbalanced

---------------------------
             |    Number of
             |  comparisons
-------------+-------------
      region |            6
---------------------------

-----------------------------------------------------
             |                              Sidak
             |   Contrast   Std. err.      t    P>|t|
-------------+---------------------------------------
      region |
   MW vs NE  |        1.2        2.1     0.58   0.993
    S vs NE  |        5.5        2.0     2.80   0.051
    W vs NE  |        2.0        2.1     0.93   0.930
    S vs MW  |        4.3        1.8     2.42   0.123
    W vs MW  |        0.7        1.9     0.38   0.999
     W vs S  |       -3.6        1.8    -2.02   0.277
-----------------------------------------------------

Logistic regression

Let’s run a logistic regression. We will use the variable highbp (Do you have high blood pressure?) as the outcome variable. Before we run our logistic regression, let’s look at the distribution of the outcome variable, and let’s look at the crosstabulation of the outcome variable with our categorical predictor variable. The purpose of the first descriptive analysis is to assess the proportion and number of cases in each level of the outcome variable. If we found that there was a very small proportion of cases in one of the levels, we may encounter estimation problems with the logistic regression. Likewise, if the crosstabulation between the outcome variable and the categorical predictor variable showed that one or more cells had very few unweighted observations or no observations at all, we might want to recode the predictor variable into fewer categories so that there were a sufficient number of observations in each cell. The results below do not look problematic, although there are relatively few observations that are coded as 0 on our outcome variable and 1 on our predictor variable. This might increase the standard error around that point estimate.

svy: tab highbp, cell obs count cellwidth(12) format(%12.2g)
(running tabulate on estimation sample)

Number of strata = 31                            Number of obs   =      10,337
Number of PSUs   = 62                            Population size = 117,023,659
                                                 Design df       =          31

----------------------------------------------------
High      |
blood     |
pressure  |        count    proportion           obs
----------+-----------------------------------------
        0 |     73871969           .63          5965
        1 |     43151690           .37          4372
          | 
    Total |    117023659             1         10337
----------------------------------------------------
Key:        count = Weighted count
       proportion = Cell proportion
              obs = Number of observations

svy: tab health highbp, cell obs count cellwidth(12) format(%12.2g)
(running tabulate on estimation sample)

Number of strata = 31                            Number of obs   =      10,335
Number of PSUs   = 62                            Population size = 116,997,257
                                                 Design df       =          31

----------------------------------------------------
1=poor,.. |
.,        |
5=excelle |           High blood pressure           
nt        |            0             1         Total
----------+-----------------------------------------
     Poor |      2550049       2987907       5537956
          |         .022          .026          .047
          |          310           419           729
          | 
     Fair |      7370803       7009458      14380261
          |         .063           .06           .12
          |          766           904          1670
          | 
  Average |     19137093      13578302      32715395
          |          .16           .12           .28
          |         1572          1366          2938
          | 
     Good |     21856626      10319684      32176310
          |          .19          .088           .28
          |         1666           925          2591
          | 
 Excellen |     22930996       9256339      32187335
          |           .2          .079           .28
          |         1649           758          2407
          | 
    Total |     73845567      43151690     116997257
          |          .63           .37             1
          |         5963          4372         10335
----------------------------------------------------
Key: Weighted count
     Cell proportion
     Number of observations

  Pearson:
    Uncorrected   chi2(4)         =  273.4001
    Design-based  F(3.53, 109.46) =   49.8959     P = 0.0000

Now let’s run the logistic regression. We can use the contrast command to get the multi-degree-of-freedom test of health (which is statistically significant). Note that the logistic regression coefficients are interpreted the same way that they are in an unweighted model.

svy: logit highbp ib3.health c.age
(running logit on estimation sample)

Survey: Logistic regression

Number of strata = 31                            Number of obs   =      10,335
Number of PSUs   = 62                            Population size = 116,997,257
                                                 Design df       =          31
                                                 F(5, 27)        =      186.86
                                                 Prob > F        =      0.0000

------------------------------------------------------------------------------
             |             Linearized
      highbp | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      health |
       Poor  |   .0422165   .1317361     0.32   0.751    -.2264609     .310894
       Fair  |   .0015363   .0676718     0.02   0.982    -.1364811    .1395538
       Good  |  -.1926015   .0614354    -3.14   0.004    -.3178998   -.0673033
  Excellent  |  -.2633746   .0769005    -3.42   0.002    -.4202142    -.106535
             |
         age |   .0473091   .0015249    31.03   0.000     .0441992    .0504191
       _cons |  -2.480283   .1060894   -23.38   0.000    -2.696654   -2.263913
------------------------------------------------------------------------------

contrast health

Contrasts of marginal linear predictions

                                                                Design df = 31

Margins: asbalanced

------------------------------------------------
             |         df           F        P>F
-------------+----------------------------------
      health |          4        3.85     0.0129
      Design |         31
------------------------------------------------
Note: F statistics are adjusted for the survey
      design.

Let’s include a categorical by continuous interaction in the model. We will use the contrast command to determine if the interaction term (as a whole) is statistically significant. We will use the contrast command to get the multi-degree-of-freedom test of the interaction, the margins command to get the predicted probabilities and the marginsplot command to graph the interaction.

svy: logit highbp ib3.health##c.age
(running logit on estimation sample)

Survey: Logistic regression

Number of strata = 31                            Number of obs   =      10,335
Number of PSUs   = 62                            Population size = 116,997,257
                                                 Design df       =          31
                                                 F(9, 23)        =       86.76
                                                 Prob > F        =      0.0000

------------------------------------------------------------------------------
             |             Linearized
      highbp | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      health |
       Poor  |   1.184622   .4603688     2.57   0.015     .2456932     2.12355
       Fair  |   .0943241   .2754073     0.34   0.734    -.4673728     .656021
       Good  |  -.3275786   .1702732    -1.92   0.064    -.6748529    .0196958
  Excellent  |  -.4030749   .1921477    -2.10   0.044    -.7949627    -.011187
             |
         age |   .0467356   .0025779    18.13   0.000     .0414779    .0519933
             |
health#c.age |
       Poor  |  -.0206937   .0078239    -2.64   0.013    -.0366506   -.0047367
       Fair  |  -.0017234   .0052911    -0.33   0.747    -.0125146    .0090678
       Good  |   .0031633   .0035254     0.90   0.376    -.0040267    .0103534
  Excellent  |   .0033972   .0040242     0.84   0.405    -.0048104    .0116047
             |
       _cons |  -2.453943   .1458084   -16.83   0.000    -2.751322   -2.156565
------------------------------------------------------------------------------

contrast health#c.age

Contrasts of marginal linear predictions

                                                                Design df = 31

Margins: asbalanced

------------------------------------------------
             |         df           F        P>F
-------------+----------------------------------
health#c.age |          4        2.54     0.0617
      Design |         31
------------------------------------------------
Note: F statistics are adjusted for the survey
      design.

margins health, at(age=(20(10)70)) vsquish vce(unconditional)

Adjusted predictions

Number of strata = 31                            Number of obs   =      10,335
Number of PSUs   = 62                            Population size = 116,997,257
                                                 Design df       =          31

Expression: Pr(highbp), predict()
1._at: age = 20
2._at: age = 30
3._at: age = 40
4._at: age = 50
5._at: age = 60
6._at: age = 70

------------------------------------------------------------------------------
             |             Linearized
             |     Margin   std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
  _at#health |
     1#Poor  |    .321152   .0676631     4.75   0.000     .1831522    .4591517
     1#Fair  |    .188563   .0273867     6.89   0.000     .1327074    .2444185
  1#Average  |   .1795748   .0154403    11.63   0.000     .1480841    .2110655
     1#Good  |   .1438662   .0112541    12.78   0.000     .1209134    .1668191
1#Excellent  |   .1353615   .0128287    10.55   0.000     .1091971    .1615258
     2#Poor  |   .3803499   .0579559     6.56   0.000     .2621481    .4985517
     2#Fair  |   .2671262   .0280849     9.51   0.000     .2098467    .3244057
  2#Average  |   .2588653   .0169863    15.24   0.000     .2242214    .2935091
     2#Good  |   .2167762   .0129495    16.74   0.000     .1903655    .2431868
2#Excellent  |   .2053746   .0157743    13.02   0.000     .1732028    .2375465
     3#Poor  |   .4433335   .0459814     9.64   0.000     .3495538    .5371132
     3#Fair  |   .3637487   .0263469    13.81   0.000     .3100139    .4174836
  3#Average  |   .3578936   .0178355    20.07   0.000     .3215178    .3942693
     3#Good  |    .313122   .0151767    20.63   0.000     .2821689    .3440752
3#Excellent  |   .2990749   .0197526    15.14   0.000     .2587891    .3393607
     4#Poor  |   .5081932   .0345058    14.73   0.000     .4378182    .5785682
     4#Fair  |    .472775   .0238764    19.80   0.000     .4240788    .5214711
  4#Average  |   .4707429   .0185394    25.39   0.000     .4329315    .5085543
     4#Good  |   .4288428   .0187103    22.92   0.000     .3906828    .4670028
4#Excellent  |   .4132904   .0249908    16.54   0.000     .3623213    .4642596
     5#Poor  |   .5727782   .0284395    20.14   0.000     .5147755    .6307808
     5#Fair  |   .5844612   .0233301    25.05   0.000     .5368791    .6320432
  5#Average  |   .5866646   .0193689    30.29   0.000     .5471616    .6261677
     5#Good  |   .5529051   .0226358    24.43   0.000     .5067391    .5990711
5#Excellent  |   .5376656   .0300081    17.92   0.000     .4764636    .5988677
     6#Poor  |   .6349738   .0311436    20.39   0.000      .571456    .6984917
     6#Fair  |   .6880968   .0245797    27.99   0.000     .6379662    .7382274
  6#Average  |   .6937161    .019649    35.31   0.000     .6536417    .7337905
     6#Good  |   .6707116   .0248188    27.02   0.000     .6200934    .7213298
6#Excellent  |   .6575232    .032423    20.28   0.000     .5913962    .7236503
------------------------------------------------------------------------------
marginsplot
Variables that uniquely identify margins: age health

Next, we will run a logistic regression model and use the estat gof command to get the Archer-Lemeshow test (which is a modification of the Hosmer-Lemeshow test that can be used with survey data). The numerator degrees of freedom are calculated as g – 1, and the denominator degrees of freedom are calculated as f – g + 2, where f = the number of sampled clusters – the number of strata and g = the number of groups (10 is the default). For the following example, we have 10 – 1 = 9 numerator degrees of freedom and 31 – 10 + 2 = 23 denominator degrees of freedom. In the output below, we see that the Archer-Lemeshow test is not statistically significant. This suggests that the model fit is acceptable. As of Stata 17, estat gof cannot follow a model specified with a subpopulation.

svy: logit highbp i.female##c.age 
(running logit on estimation sample)

Number of strata = 31                            Number of obs   =      10,337
Number of PSUs   = 62                            Population size = 117,023,659
                                                 Design df       =          31
                                                 F(3, 29)        =      335.03
                                                 Prob > F        =      0.0000

------------------------------------------------------------------------------
             |             Linearized
      highbp | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      female |
     Female  |  -1.922498   .1940278    -9.91   0.000    -2.318221   -1.526776
         age |     .03707   .0023869    15.53   0.000     .0322019    .0419381
             |
female#c.age |
     Female  |   .0294389   .0040073     7.35   0.000      .021266    .0376118
             |
       _cons |  -1.863043   .1414675   -13.17   0.000    -2.151567   -1.574518
------------------------------------------------------------------------------

estat gof

Logistic model for highbp, goodness-of-fit test

                      F(9,23) =         2.06
                     Prob > F =         0.0784

Ordinal Logistic Regression

Here is an example of an ordinal logistic regression. The outcome variable, health, has ordered categories, but it is not a continuous variable. In the first call, we get the coefficients in the log-odds metric. In the second call, we request odds ratios. The discussion of ordinal logistic regression will be minimal; for more information regarding ordinal logistic regression, please see our Stata Data Analysis Examples: Ordinal Logistic Regression and Stata Annotated Output: Ordinal Logistic Regression. Although these pages show examples that use non-weighted data, they are still helpful because the interpretation of the coefficients is the same with weighted and non-weighted data.

svy: ologit health i.female i.region age
(running ologit on estimation sample)

Survey: Ordered logistic regression

Number of strata = 31                            Number of obs   =      10,335
Number of PSUs   = 62                            Population size = 116,997,257
                                                 Design df       =          31
                                                 F(5, 27)        =      135.54
                                                 Prob > F        =      0.0000

------------------------------------------------------------------------------
             |             Linearized
      health | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      female |
     Female  |  -.1696374   .0493449    -3.44   0.002    -.2702769   -.0689979
             |
      region |
         MW  |   -.115146   .1030587    -1.12   0.272    -.3253356    .0950437
          S  |  -.5365581   .1050491    -5.11   0.000    -.7508071   -.3223091
          W  |  -.2887877   .1019162    -2.83   0.008    -.4966471   -.0809283
             |
         age |  -.0397158   .0014944   -26.58   0.000    -.0427636   -.0366681
-------------+----------------------------------------------------------------
       /cut1 |  -5.210696    .122255                     -5.460036   -4.961355
       /cut2 |  -3.717526   .1126799                     -3.947339   -3.487714
       /cut3 |  -2.202685   .1097316                     -2.426484   -1.978886
       /cut4 |  -.9350942   .1089012                       -1.1572   -.7129888
------------------------------------------------------------------------------

contrast region

Contrasts of marginal linear predictions

                                                                Design df = 31

Margins: asbalanced

------------------------------------------------
             |         df           F        P>F
-------------+----------------------------------
health       |
      region |          3        8.93     0.0002
      Design |         31
------------------------------------------------
Note: F statistics are adjusted for the survey
      design.


ologit, or

Survey: Ordered logistic regression

Number of strata = 31                            Number of obs   =      10,335
Number of PSUs   = 62                            Population size = 116,997,257
                                                 Design df       =          31
                                                 F(5, 27)        =      135.54
                                                 Prob > F        =      0.0000

------------------------------------------------------------------------------
             |             Linearized
      health | Odds ratio   std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      female |
     Female  |   .8439708   .0416456    -3.44   0.002     .7631682    .9333287
             |
      region |
         MW  |    .891236   .0918497    -1.12   0.272     .7222849    1.099707
          S  |   .5847575   .0614282    -5.11   0.000     .4719855    .7244742
          W  |   .7491713   .0763527    -2.83   0.008     .6085677    .9222598
             |
         age |   .9610625   .0014362   -26.58   0.000     .9581378    .9639961
-------------+----------------------------------------------------------------
       /cut1 |  -5.210696    .122255                     -5.460036   -4.961355
       /cut2 |  -3.717526   .1126799                     -3.947339   -3.487714
       /cut3 |  -2.202685   .1097316                     -2.426484   -1.978886
       /cut4 |  -.9350942   .1089012                       -1.1572   -.7129888
------------------------------------------------------------------------------
Note: Estimates are transformed only in the first equation to odds ratios.

One of the important assumptions in ordinal logistic regression is called the parallel odds assumption. This means that the odds are consistent across the different thresholds. As of Stata 17, there is no post-estimation command to test this assumption. However, following the advice in Heeringa, West and Berguld (2017, pages 316-317), the community-contributed command gologit2 can be used along with the test command to assess this assumption. The gologit2 command does respect factor notation, but the individual dummy variables for the categorical predictor are needed to use the test command. We can use the tab command with the gen option to create the dummy variables.

tab region, gen(regn)

     Region |      Freq.     Percent        Cum.
------------+-----------------------------------
         NE |      2,086       20.18       20.18
         MW |      2,773       26.83       47.01
          S |      2,853       27.60       74.61
          W |      2,625       25.39      100.00
------------+-----------------------------------
      Total |     10,337      100.00

svy: gologit2 health female regn2 regn3 regn4 age
(running gologit2 on estimation sample)

Survey: Generalized Ordered Logit Estimates

Number of strata = 31                            Number of obs   =      10,335
Number of PSUs   = 62                            Population size = 116,997,257
                                                 Design df       =          31
                                                 F(20, 12)       =       29.90
                                                 Prob > F        =      0.0000

------------------------------------------------------------------------------
             |             Linearized
      health | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
Poor         |
      female |   .1499993   .0959061     1.56   0.128    -.0456025     .345601
       regn2 |  -.4132141   .2415216    -1.71   0.097    -.9058006    .0793724
       regn3 |  -1.118699   .2313752    -4.83   0.000    -1.590592   -.6468062
       regn4 |  -.5114765   .2109836    -2.42   0.021    -.9417804   -.0811725
         age |  -.0569788   .0028148   -20.24   0.000    -.0627196   -.0512379
       _cons |   6.297308   .2850896    22.09   0.000     5.715864    6.878752
-------------+----------------------------------------------------------------
Fair         |
      female |  -.1703883     .06255    -2.72   0.010    -.2979599   -.0428167
       regn2 |  -.3785902   .1740228    -2.18   0.037    -.7335121   -.0236683
       regn3 |   -.813866   .1432468    -5.68   0.000     -1.10602   -.5217122
       regn4 |  -.5081594   .1359228    -3.74   0.001    -.7853757    -.230943
         age |  -.0508352   .0019272   -26.38   0.000    -.0547657   -.0469048
       _cons |   4.497852   .1656741    27.15   0.000     4.159957    4.835747
-------------+----------------------------------------------------------------
Average      |
      female |  -.1590126   .0566411    -2.81   0.009    -.2745329   -.0434923
       regn2 |  -.0846411   .1105681    -0.77   0.450    -.3101462    .1408639
       regn3 |  -.4904056   .1009544    -4.86   0.000    -.6963035   -.2845077
       regn4 |  -.2537649   .1035809    -2.45   0.020    -.4650196   -.0425101
         age |  -.0391082   .0015812   -24.73   0.000    -.0423331   -.0358833
       _cons |   2.172569   .1155751    18.80   0.000     1.936852    2.408286
-------------+----------------------------------------------------------------
Good         |
      female |  -.2187799   .0608785    -3.59   0.001    -.3429423   -.0946174
       regn2 |  -.0127317    .102245    -0.12   0.902    -.2212617    .1957983
       regn3 |  -.3856338   .1162264    -3.32   0.002     -.622679   -.1485885
       regn4 |  -.2200033   .1160418    -1.90   0.067    -.4566721    .0166655
         age |  -.0306087   .0017022   -17.98   0.000    -.0340802   -.0271371
       _cons |   .5442875   .1205459     4.52   0.000     .2984326    .7901425
------------------------------------------------------------------------------
 
test [Poor = Fair]: female

Adjusted Wald test

 ( 1)  [Poor]female - [Fair]female = 0

       F(  1,    31) =   17.87
            Prob > F =    0.0002

test [Poor = Average]: female

Adjusted Wald test

 ( 1)  [Poor]female - [Average]female = 0

       F(  1,    31) =    9.56
            Prob > F =    0.0042

test [Poor = Good]: female

Adjusted Wald test

 ( 1)  [Poor]female - [Good]female = 0

       F(  1,    31) =   14.07
            Prob > F =    0.0007
 
test [Poor = Fair]: regn2

Adjusted Wald test

 ( 1)  [Poor]regn2 - [Fair]regn2 = 0

       F(  1,    31) =    0.08
            Prob > F =    0.7821

test [Poor = Average]: regn2

Adjusted Wald test

 ( 1)  [Poor]regn2 - [Average]regn2 = 0

       F(  1,    31) =    2.49
            Prob > F =    0.1245

test [Poor = Good]: regn2

Adjusted Wald test

 ( 1)  [Poor]regn2 - [Good]regn2 = 0

       F(  1,    31) =    2.87
            Prob > F =    0.1003

test [Poor = Fair]: regn3

Adjusted Wald test

 ( 1)  [Poor]regn3 - [Fair]regn3 = 0

       F(  1,    31) =    6.27
            Prob > F =    0.0177

test [Poor = Average]: regn3

Adjusted Wald test

 ( 1)  [Poor]regn3 - [Average]regn3 = 0

       F(  1,    31) =    8.85
            Prob > F =    0.0056

test [Poor = Good]: regn3

Adjusted Wald test

 ( 1)  [Poor]regn3 - [Good]regn3 = 0

       F(  1,    31) =   10.19
            Prob > F =    0.0032

test [Poor = Fair]: regn4

Adjusted Wald test

 ( 1)  [Poor]regn4 - [Fair]regn4 = 0

       F(  1,    31) =    0.00
            Prob > F =    0.9761

test [Poor = Average]: regn4

Adjusted Wald test

 ( 1)  [Poor]regn4 - [Average]regn4 = 0

       F(  1,    31) =    1.73
            Prob > F =    0.1987

test [Poor = Good]: regn4

Adjusted Wald test

 ( 1)  [Poor]regn4 - [Good]regn4 = 0

       F(  1,    31) =    2.26
            Prob > F =    0.1432

test [Poor = Fair]: age

Adjusted Wald test

 ( 1)  [Poor]age - [Fair]age = 0

       F(  1,    31) =    6.92
            Prob > F =    0.0132

test [Poor = Average]: age

Adjusted Wald test

 ( 1)  [Poor]age - [Average]age = 0

       F(  1,    31) =   50.54
            Prob > F =    0.0000

test [Poor = Good]: age

Adjusted Wald test

 ( 1)  [Poor]age - [Good]age = 0

       F(  1,    31) =   88.85
            Prob > F =    0.0000

Multinomial Logistic Regression

The outcome variable in a multinomial logistic regression has unordered categories. One of these categories must be selected as the base, or reference, category. A series of simultaneously estimated logistic regression are run, in which each category is the alternative to the base category. In the first call, we get regression coefficients in the metric of log-odds. In the second call, we request relative risk ratios with the rrr option. For more information about multinomial logistic regression, please see Stata Data Analysis Examples: Multinomial Logistic Regression and Stata Annotated Output: Multinomial Logistic Regression Although these pages show examples that use non-weighted data, they are still helpful because the interpretation of the coefficients is the same with weighted and non-weighted data.

Before we run the multinomial logit model, it is a good idea check on the raw (AKA unweighted) number of observations in each of the cells created by the outcome variable and the two categorical predictors. There are several different commands that can be used to get this information. Because there are three variables, the tables command is convenient. This command allows for the use of a pweight, but because we are only interested in the raw count, that specification is not used here.

table region female race

----------------------------------------------
             |               Race             
             |  White   Black   Other    Total
-------------+--------------------------------
Region       |                                
  NE         |                                
    Female   |                                
      Male   |    957      51       5    1,013
      Female |  1,012      55       6    1,073
      Total  |  1,969     106      11    2,086
  MW         |                                
    Female   |                                
      Male   |  1,170     133       7    1,310
      Female |  1,291     162      10    1,463
      Total  |  2,461     295      17    2,773
  S          |                                
    Female   |                                
      Male   |  1,076     247       9    1,332
      Female |  1,208     301      12    1,521
      Total  |  2,284     548      21    2,853
  W          |                                
    Female   |                                
      Male   |  1,103      69      82    1,254
      Female |  1,234      68      69    1,371
      Total  |  2,337     137     151    2,625
  Total      |                                
    Female   |                                
      Male   |  4,306     500     103    4,909
      Female |  4,745     586      97    5,428
      Total  |  9,051   1,086     200   10,337
----------------------------------------------

svy: mlogit region i.female i.race age
(running mlogit on estimation sample)

Survey: Multinomial logistic regression

Number of strata = 31                            Number of obs   =      10,337
Number of PSUs   = 62                            Population size = 117,023,659
                                                 Design df       =          31
                                                 F(12, 20)       =        4.01
                                                 Prob > F        =      0.0031

------------------------------------------------------------------------------
             |             Linearized
      region | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
NE           |
      female |
     Female  |  -.0489935   .0575578    -0.85   0.401    -.1663833    .0683964
             |
        race |
      Black  |  -.2046099   .5385896    -0.38   0.707    -1.303071    .8938508
      Other  |  -2.492087    .675622    -3.69   0.001    -3.870027   -1.114146
             |
         age |   .0063994    .003171     2.02   0.052    -.0000679    .0128666
       _cons |  -.4682596   .1301252    -3.60   0.001    -.7336518   -.2028675
-------------+----------------------------------------------------------------
MW           |
      female |
     Female  |   .0037917   .0677251     0.06   0.956    -.1343345     .141918
             |
        race |
      Black  |    .592007   .5207979     1.14   0.264    -.4701674    1.654181
      Other  |  -2.153523   .6432367    -3.35   0.002    -3.465413    -.841633
             |
         age |   .0003402   .0033946     0.10   0.921    -.0065831    .0072635
       _cons |   -.107809   .1354847    -0.80   0.432    -.3841319    .1685138
-------------+----------------------------------------------------------------
S            |
      female |
     Female  |   .0212768    .067619     0.31   0.755    -.1166331    .1591868
             |
        race |
      Black  |   1.287368   .5454525     2.36   0.025     .1749105    2.399826
      Other  |  -2.279442   .6478685    -3.52   0.001    -3.600779   -.9581059
             |
         age |   .0070144   .0037278     1.88   0.069    -.0005885    .0146173
       _cons |  -.4274467   .1356136    -3.15   0.004    -.7040326   -.1508609
-------------+----------------------------------------------------------------
W            |  (base outcome)
------------------------------------------------------------------------------

mlogit, rrr
Survey: Multinomial logistic regression

Number of strata = 31                            Number of obs   =      10,337
Number of PSUs   = 62                            Population size = 117,023,659
                                                 Design df       =          31
                                                 F(12, 20)       =        4.01
                                                 Prob > F        =      0.0031

------------------------------------------------------------------------------
             |             Linearized
      region |        RRR   std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
NE           |
      female |
     Female  |   .9521874   .0548058    -0.85   0.401     .8467216     1.07079
             |
        race |
      Black  |   .8149652   .4389318    -0.38   0.707     .2716962    2.444525
      Other  |   .0827372    .055899    -3.69   0.001     .0208578    .3281953
             |
         age |    1.00642   .0031913     2.02   0.052     .9999321     1.01295
       _cons |   .6260909   .0814702    -3.60   0.001     .4801524    .8163864
-------------+----------------------------------------------------------------
MW           |
      female |
     Female  |   1.003799   .0679824     0.06   0.956     .8742976    1.152482
             |
        race |
      Black  |   1.807613   .9414009     1.14   0.264     .6248976    5.228798
      Other  |   .1160745   .0746634    -3.35   0.002     .0312601    .4310061
             |
         age |    1.00034   .0033957     0.10   0.921     .9934386     1.00729
       _cons |    .897799    .121638    -0.80   0.432     .6810416    1.183545
-------------+----------------------------------------------------------------
S            |
      female |
     Female  |   1.021505   .0690732     0.31   0.755     .8899116    1.172557
             |
        race |
      Black  |   3.623238   1.976304     2.36   0.025      1.19114    11.02126
      Other  |   .1023413   .0663037    -3.52   0.001     .0273024    .3836188
             |
         age |   1.007039   .0037541     1.88   0.069     .9994116    1.014725
       _cons |   .6521721   .0884434    -3.15   0.004     .4945868    .8599673
-------------+----------------------------------------------------------------
W            |  (base outcome)
------------------------------------------------------------------------------
Note: _cons estimates baseline relative risk for each outcome.

In the next example, we will change the base category to “North East”, which has a numerical value of 1.

svy: mlogit region i.female i.race, base(1)
(running mlogit on estimation sample)

Survey: Multinomial logistic regression

Number of strata = 31                            Number of obs   =      10,337
Number of PSUs   = 62                            Population size = 117,023,659
                                                 Design df       =          31
                                                 F(9, 23)        =        3.84
                                                 Prob > F        =      0.0044

------------------------------------------------------------------------------
             |             Linearized
      region | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
NE           |  (base outcome)
-------------+----------------------------------------------------------------
MW           |
      female |
     Female  |   .0485305   .0590299     0.82   0.417    -.0718618    .1689228
             |
        race |
      Black  |    .811901   .3223367     2.52   0.017      .154491    1.469311
      Other  |   .3474752   .4970103     0.70   0.490    -.6661839    1.361134
             |
       _cons |   .1049302    .053029     1.98   0.057    -.0032232    .2130836
-------------+----------------------------------------------------------------
S            |
      female |
     Female  |    .070287   .0559074     1.26   0.218    -.0437369    .1843109
             |
        race |
      Black  |   1.490358   .3640586     4.09   0.000     .7478561    2.232861
      Other  |    .211622   .4869092     0.43   0.667    -.7814358     1.20468
             |
       _cons |   .0674677   .0940191     0.72   0.478    -.1242856     .259221
-------------+----------------------------------------------------------------
W            |
      female |
     Female  |    .044361   .0586397     0.76   0.455    -.0752354    .1639574
             |
        race |
      Black  |   .2207398   .5415324     0.41   0.686    -.8837228    1.325202
      Other  |   2.501477   .6638524     3.77   0.001     1.147541    3.855413
             |
       _cons |   .1987249   .0612995     3.24   0.003     .0737038     .323746
------------------------------------------------------------------------------

Let’s add an interaction term to the model. As you will notice in the output of the contrast command, the multi-degree-of-freedom of the interaction term is not testable because we have run out of degrees of freedom. From the graph of the interaction, though, it seems safe to conclude that the interaction is not statistically significant.

svy: mlogit region i.female##i.race age, base(1)
(running mlogit on estimation sample)

Survey: Multinomial logistic regression

Number of strata = 31                            Number of obs   =      10,337
Number of PSUs   = 62                            Population size = 117,023,659
                                                 Design df       =          31
                                                 F(18, 14)       =        4.83
                                                 Prob > F        =      0.0023

-------------------------------------------------------------------------------
              |             Linearized
       region | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
--------------+----------------------------------------------------------------
NE            |  (base outcome)
--------------+----------------------------------------------------------------
MW            |
       female |
      Female  |   .0519154    .058866     0.88   0.385    -.0681426    .1719734
              |
         race |
       Black  |   .8044695   .3580839     2.25   0.032     .0741525    1.534787
       Other  |   .5287782   .5443612     0.97   0.339    -.5814539     1.63901
              |
  female#race |
Female#Black  |  -.0160267   .2435719    -0.07   0.948    -.5127948    .4807415
Female#Other  |  -.3601052   .6504467    -0.55   0.584      -1.6867    .9664896
              |
          age |  -.0060579   .0029424    -2.06   0.048     -.012059   -.0000567
        _cons |   .3609425   .1459742     2.47   0.019     .0632263    .6586588
--------------+----------------------------------------------------------------
S             |
       female |
      Female  |   .0644129   .0570283     1.13   0.267    -.0518972     .180723
              |
         race |
       Black  |   1.502003    .392423     3.83   0.001     .7016507    2.302355
       Other  |   .3834883   .5361527     0.72   0.480    -.7100024    1.476979
              |
  female#race |
Female#Black  |  -.0199941   .2141128    -0.09   0.926      -.45668    .4166919
Female#Other  |  -.3231053   .6494598    -0.50   0.622    -1.647687    1.001477
              |
          age |   .0006218   .0033746     0.18   0.855    -.0062606    .0075042
        _cons |   .0437093   .1428439     0.31   0.762    -.2476229    .3350414
--------------+----------------------------------------------------------------
W             |
       female |
      Female  |    .080238    .047257     1.70   0.100    -.0161433    .1766193
              |
         race |
       Black  |   .4298017   .6454891     0.67   0.510    -.8866821    1.746285
       Other  |   2.769744    .757099     3.66   0.001     1.225631    4.313858
              |
  female#race |
Female#Black  |  -.4412226   .3404096    -1.30   0.204    -1.135493    .2530474
Female#Other  |  -.5391986   .6086492    -0.89   0.382    -1.780547    .7021497
              |
          age |  -.0064322   .0031729    -2.03   0.051    -.0129034     .000039
        _cons |   .4534141   .1291857     3.51   0.001     .1899381    .7168901
-------------------------------------------------------------------------------

contrast female#race

Contrasts of marginal linear predictions

                                                                Design df = 31

Margins: asbalanced

------------------------------------------------
             |         df           F        P>F
-------------+----------------------------------
NE           |
 female#race |  (not testable)
      Design |         31
------------------------------------------------
Note: F statistics are adjusted for the survey
      design.

margins female#race, vce(unconditional)

Adjusted predictions

Number of strata = 31                            Number of obs   =      10,337
Number of PSUs   = 62                            Population size = 117,023,659
                                                 Design df       =          31

1._predict: Pr(region==NE), predict(pr outcome(1))
2._predict: Pr(region==MW), predict(pr outcome(2))
3._predict: Pr(region==S), predict(pr outcome(3))
4._predict: Pr(region==W), predict(pr outcome(4))

--------------------------------------------------------------------------------------
                     |             Linearized
                     |     Margin   std. err.      t    P>|t|     [95% conf. interval]
---------------------+----------------------------------------------------------------
_predict#female#race |
       1#Male#White  |   .2280473   .0089673    25.43   0.000     .2097584    .2463362
       1#Male#Black  |   .0981057   .0289711     3.39   0.002     .0390188    .1571926
       1#Male#Other  |   .0420797    .026005     1.62   0.116    -.0109579    .0951173
     1#Female#White  |   .2171943   .0084349    25.75   0.000     .1999912    .2343974
     1#Female#Black  |   .1005365   .0236472     4.25   0.000     .0523078    .1487652
     1#Female#Other  |   .0633072   .0378127     1.67   0.104    -.0138123    .1404267
       2#Male#White  |   .2534941   .0097424    26.02   0.000     .2336244    .2733639
       2#Male#Black  |   .2465516   .0471339     5.23   0.000     .1504215    .3426818
       2#Male#Other  |    .079911   .0484013     1.65   0.109     -.018804     .178626
     2#Female#White  |   .2530577   .0106598    23.74   0.000     .2313169    .2747984
     2#Female#Black  |   .2625757   .0512328     5.13   0.000     .1580857    .3670657
     2#Female#Other  |   .0883006   .0403342     2.19   0.036     .0060386    .1705627
       3#Male#White  |   .2446779   .0163735    14.94   0.000      .211284    .2780719
       3#Male#Black  |   .4721417   .0721362     6.55   0.000      .325019    .6192644
       3#Male#Other  |   .0662021   .0385262     1.72   0.096    -.0123727    .1447769
     3#Female#White  |   .2486682   .0181709    13.68   0.000     .2116084     .285728
     3#Female#Black  |   .5056737   .0641249     7.89   0.000     .3748902    .6364572
     3#Female#Other  |   .0769021   .0323762     2.38   0.024     .0108703    .1429339
       4#Male#White  |   .2737806   .0121776    22.48   0.000     .2489441    .2986171
       4#Male#Black  |   .1832009   .0865927     2.12   0.043      .006594    .3598078
       4#Male#Other  |   .8118072   .0994732     8.16   0.000     .6089302    1.014684
     4#Female#White  |   .2810799   .0151421    18.56   0.000     .2501973    .3119624
     4#Female#Black  |   .1312141   .0443014     2.96   0.006     .0408609    .2215674
     4#Female#Other  |   .7714901   .0855077     9.02   0.000      .597096    .9458841
--------------------------------------------------------------------------------------

marginsplot

Variables that uniquely identify margins: female race

Running out of degrees of freedom is not an uncommon problem in multinomial models run on complex survey data. In the example above, we were not able to calculate the multi-degree-of-freedom test of the interaction. In other situations, when we include the interaction term, we do not get an F-statistic or a p-value for the overall model.

Again, the issue is caused by the degrees of freedom. You might think that because we have so many observations in this dataset that we could run a model with many predictors and not run out of degrees of freedom, but that is not the case. A great explanation of the problem can be found in the Stata j_robustsingular help file that you can access by clicking on the blue in the Stata output or by typing help j_robustsingular in the Stata command window.

In some analyses, the F-statistic for the overall model may not be calculated; however, the parameters are still estimated correctly. The standard errors should be carefully considered. The theory justifying the calculation of the standard errors assumes a large number of clusters minus strata. If the standard errors are suspect, then so are the test statistics and their p-values.

Count Models

Both Poisson and negative binomial models can be used with complex survey data. We will start with a simple Poisson model with a single binary predictor. We will then investigate the plausibility that the conditional variance is approximately equal to the condition mean. For more information regarding Poisson models, please see Stata Data Analysis Examples: Poisson Regression and Stata Annotated Output: Poisson Regression. For more information regarding negative binomial models, please see Stata Data Analysis Examples: Negative Binomial Regression and Stata Annotated Output: Negative Binomial Regression.

svy: poisson hdresult i.female
(running poisson on estimation sample)

Survey: Poisson regression

Number of strata = 31                             Number of obs   =      8,708
Number of PSUs   = 60                             Population size = 98,606,432
                                                  Design df       =         29
                                                  F(1, 29)        =     506.37
                                                  Prob > F        =     0.0000

------------------------------------------------------------------------------
             |             Linearized
    hdresult | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      female |
     Female  |   .1713176   .0076132    22.50   0.000     .1557469    .1868883
       _cons |   3.810654   .0085857   443.84   0.000     3.793094    3.828214
------------------------------------------------------------------------------
Note: Strata with single sampling unit centered at overall mean.

quietly svy: mean hdresult, over(female)

estat sd, var

-------------------------------------
        Over |       Mean    Variance
-------------+-----------------------
  c.hdresult@|
      female |
       Male  |   45.17997    153.0132
     Female  |   53.62264    215.3385
-------------------------------------

It seems that we have overdispersion, which is quite common in count models. Let’s try a negative binomial model. Notice that no test of the alpha parameter being equal to 0 is given, as it is when this analysis is run with unweighted data.

svy: nbreg hdresult i.female
(running nbreg on estimation sample)

Survey: Negative binomial regression

Number of strata = 31                             Number of obs   =      8,708
Number of PSUs   = 60                             Population size = 98,606,432
                                                  Design df       =         29
                                                  F(1, 29)        =     506.37
Dispersion: mean                                  Prob > F        =     0.0000

------------------------------------------------------------------------------
             |             Linearized
    hdresult | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      female |
     Female  |   .1713176   .0076132    22.50   0.000     .1557469    .1868883
       _cons |   3.810654   .0085857   443.84   0.000     3.793094    3.828214
-------------+----------------------------------------------------------------
    /lnalpha |  -3.008826   .0276271                      -3.06533   -2.952322
-------------+----------------------------------------------------------------
       alpha |   .0493496   .0013634                      .0466385    .0522183
------------------------------------------------------------------------------
Note: Strata with single sampling unit centered at overall mean.

margins female, vce(unconditional)

Adjusted predictions

Number of strata = 31                             Number of obs   =      8,708
Number of PSUs   = 60                             Population size = 98,606,432
                                                  Design df       =         29

Expression: Predicted number of events, predict()

------------------------------------------------------------------------------
             |             Linearized
             |     Margin   std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      female |
       Male  |   45.17997   .3879036   116.47   0.000     44.38662    45.97332
     Female  |   53.62264   .4392396   122.08   0.000      52.7243    54.52099
------------------------------------------------------------------------------

Let’s try including an interaction between female and age.

svy: nbreg hdresult i.female##c.age
(running nbreg on estimation sample)

Survey: Negative binomial regression

Number of strata = 31                             Number of obs   =      8,708
Number of PSUs   = 60                             Population size = 98,606,432
                                                  Design df       =         29
                                                  F(3, 27)        =     199.91
Dispersion: mean                                  Prob > F        =     0.0000

------------------------------------------------------------------------------
             |             Linearized
    hdresult | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      female |
     Female  |    .131853   .0185006     7.13   0.000      .094015     .169691
         age |  -.0000729   .0003018    -0.24   0.811    -.0006902    .0005445
             |
female#c.age |
     Female  |   .0009366   .0003886     2.41   0.022      .000142    .0017313
             |
       _cons |   3.813683   .0154928   246.16   0.000     3.781996    3.845369
-------------+----------------------------------------------------------------
    /lnalpha |   -3.01082   .0273982                     -3.066855   -2.954784
-------------+----------------------------------------------------------------
       alpha |   .0492513   .0013494                      .0465674    .0520899
------------------------------------------------------------------------------
Note: Strata with single sampling unit centered at overall mean.

We can use the margins command to get the predicted counts.

margins female, at(age=(20(10)60)) vce(unconditional)

Adjusted predictions

Number of strata = 31                             Number of obs   =      8,708
Number of PSUs   = 60                             Population size = 98,606,432
                                                  Design df       =         29

Expression: Predicted number of events, predict()
1._at: age = 20
2._at: age = 30
3._at: age = 40
4._at: age = 50
5._at: age = 60

------------------------------------------------------------------------------
             |             Linearized
             |     Margin   std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
  _at#female |
     1#Male  |   45.25103   .4971356    91.02   0.000     44.23427    46.26779
   1#Female  |     52.605   .5194601   101.27   0.000     51.54258    53.66741
     2#Male  |   45.21807   .4250519   106.38   0.000     44.34874     46.0874
   2#Female  |   53.06136   .4554657   116.50   0.000     52.12983    53.99289
     3#Male  |   45.18514   .3893427   116.05   0.000     44.38884    45.98143
   3#Female  |   53.52167   .4379633   122.21   0.000     52.62594    54.41741
     4#Male  |   45.15223    .399671   112.97   0.000     44.33481    45.96965
   4#Female  |   53.98598   .4747944   113.70   0.000     53.01492    54.95705
     5#Male  |   45.11934   .4527185    99.66   0.000     44.19343    46.04525
   5#Female  |   54.45432   .5575775    97.66   0.000     53.31395     55.5947
------------------------------------------------------------------------------

marginsplot

For more information on using the NHANES data sets

There are helpful resources for learning how to analyze the NHANES data sets correctly. One is a listserv at http://www.cdc.gov/nchs/nhanes/nhanes_listserv.htm . There are also online tutorials at http://www.cdc.gov/nchs/tutorials/index.htm .

References

Applied Survey Data Analysis, Second Edition by Steven G. Heeringa, Brady T. West, and Patricia A. Berglund

A survey of survey statistics: What is done and can be done in Stata by Frauke Kreuter and Richard Valliant, The Stata Journal (2007), Volume 7, Number 1, pages 1-21.

Goodness-of-fit test for a logistic regression model fitted using survey sample data by Kellie J. Archer and Stanley Lemeshow, The Stata Journal (2006), Volume 6, Number 1, pages 97-105.

Analysis of Health Surveys by Edward L. Korn and Barry I. Graubard

Sampling of Populations: Methods and Applications, Fourth Edition by Paul Levy and Stanley Lemeshow

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner

Sampling Techniques, Third Edition by William G. Cochran

Stata 17 Manual: Survey Data

StatNews #73: Overlapping Confidence Intervals and Statistical Significance, October 2008 by the Cornell Statistical Consulting Unit

Why Overlapping Confidence Intervals Mean Nothing about Statistical Significance