Survey Data Analysis in Stata

The purpose of this seminar is to explore how to analyze survey

data collected under different sampling plans using Stata. Other examples, including those using other survey data analysis packages, can be found at Choosing the Correct Analysis for Various Survey Designs. Before we begin looking at examples in Stata, we will quickly review some basic issues and concepts in survey data analysis.

Why do we need survey data analysis software?

Regular statistical software (that is not designed for survey data) analyzes data as if the data were collected using simple random sampling. For experimental and quasi-experimental designs, this is exactly what we want. However, when surveys are conducted, a simple random sample is rarely collected. Not only is it nearly impossible to do so, but it is not as efficient (both financially and statistically) as other sampling methods. When any sampling method other than simple random sampling is used, we usually need to use survey data analysis software to take into account the differences between the design that was used and simple random sampling. This is because the sampling design affects the calculation of the standard errors of the estimates. If you ignore the sampling design, e.g., if you assume simple random sampling when another type of sampling design was used, the standard errors will likely be underestimated, possibly leading to results that seem to be statistically significant, when in fact, they are not. The difference in point estimates and standard errors obtained using non-survey software and survey software with the design properly specified will vary from data set to data set, and even between variables within the same data set. While it may be possible to get reasonably accurate results using non-survey software, there is no practical way to know beforehand how far off the results from non-survey software will be.

Sampling designs

Most people do not conduct their own surveys. Rather, they use survey data that some agency or company collected and made available to the public. The documentation must be read carefully to find out what kind of sampling design was used to collect the data. This is very important because many of the estimates and standard errors are calculated differently for the different sampling designs. Hence, if you mis-specify the sampling design, the point estimates and standard errors will likely be wrong.

Below are some common features of many sampling designs.

Weights: There are many types of weights that can be associated with a survey. Perhaps the most common is the probability weight, called a pweight in Stata, which is used to denote the inverse of the probability of being included in the sample due to the sampling design (except for a certainty PSU, see below). The probability weight is calculated as N/n, where N = the number of elements in the population and n = the number of elements in the sample. For example, if a population has 10 elements and 3 are sampled at random with replacement, then the pweight would be 10/3 = 3.33. In a two-stage design, the probability weight is calculated as f₁f₂, which means that the inverse of the sampling fraction for the first stage is multiplied by the inverse of the sampling fraction for the second stage. Under many sampling plans, the sum of the probability weights will be a good estimate of the population size.

PSU: This is the primary sampling unit. This is the first unit that is sampled in the design. For example, school districts from California may be sampled and then schools within districts may be sampled. The school district would be the PSU. If states from the US were sampled, and then school districts from within each state, and then schools from within each district, then states would be the PSU. One does not need to use the same sampling method at all levels of sampling. For example, probability-proportional-to-size sampling may be used at level 1 (to select states), while cluster sampling is used at level 2 (to select school districts). In the case of a simple random sample, the PSUs and the elementary units are the same.

Strata: Stratification is a method of breaking up the population into different groups, often by demographic variables such as gender, race or SES. Once these groups have been defined, one samples from each group as if it were independent of all of the other groups. For example, if a sample is to be stratified on gender, men and women would be sampled independent of one another. This means that the pweights for men will likely be different from the pweights for the women. In most cases, you need to have two or more PSUs in each stratum. The purpose of stratification is to improve the precision of the estimates, and stratification works most effectively when the variance of the dependent variable is smaller within the strata than in the sample as a whole.

FPC: This is the finite population correction. This is used when the sampling fraction (the number of elements or respondents sampled relative to the population) becomes large. The FPC is used in the calculation of the standard error of the estimate. If the value of the FPC is close to 1, it will have little impact and can be safely ignored. In some survey data analysis programs, such as SUDAAN, this information will be needed if you specify that the data were collected without replacement (see below for a definition of “without replacement”). The formula for calculating the FPC is ((N-n)/(N-1))^1/2, where N is the number of elements in the population and n is the number of elements in the sample. To see the impact of the FPC for samples of various proportions, suppose that you had a population of 10,000 elements.

Sample size (n)    FPC
1                1.0000
10                .9995 
100               .9950
500               .9747
1000              .9487
5000              .7071
9000              .3162

Sampling with and without replacement

Most samples collected in the real world are collected “without replacement”. This means that once a respondent has been selected to be in the sample and has participated in the survey, that particular respondent cannot be selected again to be in the sample. Many of the calculations change depending on if a sample is collected with or without replacement. Hence, programs like SUDAAN request that you specify if a survey sampling design was implemented with our without replacement, and an FPC is used if sampling without replacement is used, even if the value of the FPC is very close to one.

Examples

In the examples that follow, we have data that represent a population, and we will discuss the analysis of these survey data as if they had been collected under five sampling plans: simple random sampling, stratified random sampling, systematic sampling, one-stage cluster sampling and two-stage cluster sampling with stratification. The Stata code necessary to generate the samples using each of these sampling plans is shown here. The variables from the data set with which we will be working include api00 and api99, which is an aggregate of student test scores for each school, for the years 2000 and 1999, respectively; yr_rnd, which is a 0/1 variable indicating if the school is on a year-round calendar; awards, which indicates whether or not the school met their target; meals, which indicates the percentage of children receiving free or reduced-priced meals at school; both, which indicates that the school met both targets; and growth, which is the difference between the api scores in the current year and those of the last year.

One of the most important points to remember is that all svy commands can be used with any sampling plan. To help illustrate this, we will use the svymean and the svytotal commands with each sampling plan. Another important point is that the interpretation of the results from the svy commands is usually no different than the interpretation that you would have if you had used the equivalent non-survey command. For example, there is no special interpretation of regression coefficients just because you obtained them using svyreg instead of regress.

Simple random sample

We will start by showing how you can take a simple random sample (SRS) from you data file. While we will not go through the commands necessary for obtaining any other type of sample, we will go over how to draw an SRS. Simple random samples are very rare in actual practice; however, researchers will often draw an SRS of their data set so that they can work out their data analysis programs on a relatively small data file. This saves computing time and resources, as the analysis program may have to be run many times before it is satisfactory.

set mem 5m
use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear
count

 6194

set seed 1003002849
sample 5
count

 310

Because we have eliminated elements of our population to create our sample, we need to create pweights (probability weights). We selected 5% of the elements in our population into our sample, so our sampling fraction is 310/6194. The pweight is the inverse of the sampling fraction, or N/n, where N is the population total (6194) and n is the number of elements selected into the sample (310). Another way to think of this is: “How many elements (schools, people, whatever) in the population should each element in the sample represent?” Clearly, each school in our current sample should represent twenty schools in the population, so all of the p-weights will be the same; approximately 20.

gen pw = 6194/310

Next, we need to consider how large our sample is relative to our population to determine if we need to use a finite population correction. (For a quick review of FPCs, please see the summary at the beginning of this handout.) In Stata, we only need to give the population total, and Stata will make the necessary calculations to obtain the correct FPC. Note that the svyset command is very different in Stata 8 than it was in Stata 7.

gen fpc = 6194

We use the svyset command to tell Stata about the features of the sampling design that we have. In this case, we only need to specify the pweight and the FPC.

svyset [pweight=pw], fpc(fpc)

Next, we will use the svydes command to display the information Stata has regarding our sampling plan. As you can see, the number of PSUs and observations is the same, which reassures us that Stata understands that we have a simple random sample. We also see that there is only one strata, which is correct for this type of sampling plan. Note that once you have used the svyset command, Stata will remember this information for your entire session; you do not need to reissue this command (unless you want to change something). Also, if you save your data, Stata will save the survey information with the data set, so that when you open the data in your next session, the survey information will be used when you issue svy commands.

svydes

Strata:   <one>
PSU:      <observations>
FPC:      fpc
                                      #Obs per PSU
 Strata                       ----------------------------
  <one>     #PSUs     #Obs       min      mean       max
--------  --------  --------  --------  --------  --------
       1       310       310         1       1.0         1
--------  --------  --------  --------  --------  --------
       1       310       310         1       1.0         1

We will start our analysis of these data with some basic descriptive statistics. We will use the svymean and svytotal commands. The svymean command is used to estimate the mean of a variable in the population. In our example, we will estimate the mean for api00 and growth.

svymean api00 growth

Survey mean estimation

pweight:  pw                                      Number of obs    =       310
Strata:   <one>                                   Number of strata =         1
PSU:      <observations>                          Number of PSUs   =       310
FPC:      fpc                                     Population size  = 6193.9997

------------------------------------------------------------------------------
    Mean |   Estimate    Std. Err.   [95% Conf. Interval]        Deff
---------+--------------------------------------------------------------------
   api00 |   663.2645     7.21478    649.0682    677.4608           1
  growth |   33.84516    1.667394    30.56428    37.12604           1
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without 
replacement of PSUs within each stratum with no subsampling within PSUs.
Weights must represent population totals for deff to be correct when
using an FPC.  Note: deft is invariant to the scale of weights.

The svytotal command is used to get estimates of population totals. In our example, we will get an estimate of how many schools are on a year-round calendar. From the output of the svytotal command, we can see that approximately 719 schools are on a year-round calendar.

svytotal yr_rnd

Survey total estimation

pweight:  pw                                      Number of obs    =       310
Strata:   <one>                                   Number of strata =         1
PSU:      <observations>                          Number of PSUs   =       310
FPC:      fpc                                     Population size  = 6193.9997

------------------------------------------------------------------------------
   Total |   Estimate    Std. Err.   [95% Conf. Interval]        Deff
---------+--------------------------------------------------------------------
  yr_rnd |   719.3032    110.0291    502.8022    935.8042           1
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without 
replacement of PSUs within each stratum with no subsampling within PSUs.
Weights must represent population totals for deff to be correct when
using an FPC.  Note: deft is invariant to the scale of weights.

We will now do a multiple regression. We will use api00 as the dependent variable and award and meals as independent variables. We can see from the output that the model is statistically significant (F = 464.21, p < .000), and that each of the predictors is also statistically significant. You can interpret the output from the svy commands in the same way that you would the non-svy command. In this example, you interpret the output from the svyreg command in the same way that you would the output from the regress command. Remember that the difference between the svyreg and the regress commands is how the standard errors are calculated. The svyreg command takes into account the survey sampling plan, while the regress command does not.

svyreg api00 awards meals

Survey linear regression

pweight:  pw                                      Number of obs    =       310
Strata:   <one>                                   Number of strata =         1
PSU:      <observations>                          Number of PSUs   =       310
FPC:      fpc                                     Population size  = 6193.9997
                                                  F(   2,    308)  =    464.21
                                                  Prob > F         =    0.0000
                                                  R-squared        =    0.7124

------------------------------------------------------------------------------
       api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      awards |   53.37164   9.047051     5.90   0.000     35.57002    71.17325
       meals |  -3.329605   .1285952   -25.89   0.000    -3.582638   -3.076571
       _cons |   792.2924   11.52925    68.72   0.000     769.6066    814.9781
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without 
replacement of PSUs within each stratum with no subsampling within PSUs.

Stratified random sampling

The difference between the example above and the next example is that stratification has been added to the sampling design. For this example, we have calculated the mean of api99 and stratified schools based on this. Schools that were above the mean were placed into one strata, and schools that were below the mean were placed in the other strata. Simple random samples of schools were then drawn from each strata. Although we have created only two strata, in many public-use data sets, you can have dozens of strata.

We have used the svyset, clear(all) command here to show how it is used. After issuing the svyset command, we again use the svydes command to ensure that Stata is handling the survey design appropriately. Next, we use the svymean to obtain the estimated means of api00 and api99. We can compare these estimates to those obtained from the SRS above. (Please see the table at the end of this handout.)

use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/strsrs, clear

svyset, clear(all)
svyset [pweight = pw], strata(strat) fpc(fpc)
svydes

pweight:  pw
Strata:   strat
PSU:      <observations>
FPC:      fpc
                                      #Obs per PSU
 Strata                       ----------------------------
  strat     #PSUs     #Obs       min      mean       max
--------  --------  --------  --------  --------  --------
       1       310       310         1       1.0         1
       2       310       310         1       1.0         1
--------  --------  --------  --------  --------  --------
       2       620       620         1       1.0         1

Below we use the svymean command to get the population estimate of the mean of api00. Notice the value of the design effect, labeled Deff in the output (on the far right of the table). The design effect compares the current sampling design (in this case, stratified random sampling) with simple random sampling. Design effects of 1 (or close to 1) indicate that the current sampling design is about as efficient as a simple random sample. Design effects that are smaller than 1 indicate that the current design is more efficient than simple random sampling, while design effects that are larger than 1 indicate that the current sampling design is less efficient than simple random sampling. Here, we can see the benefit of the stratification: the design effect for api00 is .35, well below 1. However, you will remember that we stratified on the mean of api99, which is closely related to api00, the variable for which we are getting an estimate.

svymean api00 growth

Survey mean estimation

pweight:  pw                                      Number of obs    =       620
Strata:   strat                                   Number of strata =         2
PSU:      <observations>                          Number of PSUs   =       620
FPC:      fpc                                     Population size  = 6193.9997

------------------------------------------------------------------------------
    Mean |   Estimate    Std. Err.   [95% Conf. Interval]        Deff
---------+--------------------------------------------------------------------
   api00 |   665.6216    2.957053    659.8145    671.4287    .3468751
  growth |   33.26666     1.10437    31.09788    35.43543    .9629825
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without 
replacement of PSUs within each stratum with no subsampling within PSUs.
Weights must represent population totals for deff to be correct when
using an FPC.  Note: deft is invariant to the scale of weights.

In the results of the svytotal shown below, you will see that the design effect is not much smaller than 1; in other words, we get relatively little benefit from the stratification. That is because there is not much of a relationship between api99 and yr_rnd. The point here is that to be genuinely useful, you need to stratify on variable(s) closely related to the variable of interest. In many cases, this will mean that while stratification will make some estimates more efficient, it will not do so for others.

svytotal yr_rnd

Survey total estimation

pweight:  pw                                      Number of obs    =       620
Strata:   strat                                   Number of strata =         2
PSU:      <observations>                          Number of PSUs   =       620
FPC:      fpc                                     Population size  = 6193.9997

------------------------------------------------------------------------------
   Total |   Estimate    Std. Err.   [95% Conf. Interval]        Deff
---------+--------------------------------------------------------------------
  yr_rnd |   789.5516    76.59607    639.1314    939.9717    .9457493
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without 
replacement of PSUs within each stratum with no subsampling within PSUs.
Weights must represent population totals for deff to be correct when
using an FPC.  Note: deft is invariant to the scale of weights.

When estimates are made for each strata, they are made independently of all other strata. In other words, the estimate of yr_rnd for strata 1 was made independently of the estimate for strata 2. Also note that the sum of the estimates for strata 1 and strata 2 equals the value shown above.

svytotal yr_rnd, by(strat)

Survey total estimation

pweight:  pw                                      Number of obs    =       620
Strata:   strat                                   Number of strata =         2
PSU:      <observations>                          Number of PSUs   =       620
FPC:      fpc                                     Population size  = 6193.9997

------------------------------------------------------------------------------
Total  Subpop. |   Estimate    Std. Err.   [95% Conf. Interval]        Deff
---------------+--------------------------------------------------------------
yr_rnd         |
      strat==1 |   639.7935    67.69423    506.8549    772.7321    .8870259
      strat==2 |   149.7581    35.83923    79.37663    220.1395     .976068
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without 
replacement of PSUs within each stratum with no subsampling within PSUs.
Weights must represent population totals for deff to be correct when
using an FPC.  Note: deft is invariant to the scale of weights.

Systematic sampling

Systematic sampling is just that: drawing a sample from elements that are ordered in a systematic way. For example, you might take a systematic sample of library books by selecting every k-th book from the books on the shelf. (Remember that librarians hate when people actually do this!) Of course, first you need to determine how large of a sample you want to select. There are 6194 schools in our sample, and we would like to use systematic sampling to select a sample of size 500. First, we need to determine the “rate” at which schools should be selected. We do this by dividing the number of elements (e.g., schools) by the number desired in the sample. Therefore, k = 6194/500 = 12.38, which we will round to 13. Hence, we will select every 13th school. We will also randomly select a number from 1 to 13 and start counting from there. In our example, we selected the number 4. Hence, we ordered the schools from lowest id number to highest id number, started with school number 4, and then selected into our sample every 13th school. After creating our sample, we follow the same procedure as before: open the correct data file, issue the svyset command, check to see that everything is OK with the svydes command, and then begin our analysis with descriptive statistics.

use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/systematic.dta, clear
svyset [pweight = pw], fpc(fpc)
svydes

pweight:  pw
Strata:   <one>
PSU:      <observations>
FPC:      fpc
                                      #Obs per PSU
 Strata                       ----------------------------
  <one>     #PSUs     #Obs       min      mean       max
--------  --------  --------  --------  --------  --------
       1       477       477         1       1.0         1
--------  --------  --------  --------  --------  --------
       1       477       477         1       1.0         1

Below we get the population estimates for the mean of api00 and growth.

svymean api00 growth

Survey mean estimation

pweight:  pw                                      Number of obs    =       477
Strata:   <one>                                   Number of strata =         1
PSU:      <observations>                          Number of PSUs   =       477
FPC:      fpc                                     Population size  =      6194

------------------------------------------------------------------------------
    Mean |   Estimate    Std. Err.   [95% Conf. Interval]        Deff
---------+--------------------------------------------------------------------
   api00 |   656.3061    5.655353    645.1935    667.4186           1
  growth |   33.08595    1.226588    30.67576    35.49615           1
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without 
replacement of PSUs within each stratum with no subsampling within PSUs.
Weights must represent population totals for deff to be correct when
using an FPC.  Note: deft is invariant to the scale of weights.

Notice that the design effect for all variables is 1. This is not necessarily because systematic sampling is always just as efficient as simple random sampling. Rather, it has to do with the information that you have given to Stata. The design effect is influenced by setting the strata and PSU. In both simple random sampling and systematic sampling, we set neither the strata or PSU. Hence, Stata “can’t tell the two sampling plans apart.” Because the specification of the sampling design is exactly the same as with simple random sampling, the design effect is 1. However, you can calculate the design effect by hand by dividing the variance of the variable of interest under the current sampling design by the variance of the same variable under simple random sampling. We did this and found that the design effects were very close to 1. We found them to be .96 for api00, .93 for growth and 1.2 for yr_rnd.

svytotal yr_rnd

Survey total estimation

pweight:  pw                                      Number of obs    =       477
Strata:   <one>                                   Number of strata =         1
PSU:      <observations>                          Number of PSUs   =       477
FPC:      fpc                                     Population size  =      6194

------------------------------------------------------------------------------
   Total |   Estimate    Std. Err.   [95% Conf. Interval]        Deff
---------+--------------------------------------------------------------------
  yr_rnd |   779.1195    90.44644    601.3958    956.8431           1
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without 
replacement of PSUs within each stratum with no subsampling within PSUs.
Weights must represent population totals for deff to be correct when
using an FPC.  Note: deft is invariant to the scale of weights.

Below we show the use of the svytab command. This can be used to make one- and two-way crosstabulations. Here we will make a crosstab of both and awards. The values in the cells are proportions. The svytab command also gives us the chi-square test for these two variables. We can see that the relationship between them is statistically significant.

svytab both awards

pweight:  pw                                    Number of obs      =       477
Strata:   <one>                                 Number of strata   =         1
PSU:      <observations>                        Number of PSUs     =       477
FPC:      fpc                                   Population size    =      6194

-------------------------------
met both  | eligible for awards
targets   |    no    yes  Total
----------+--------------------
       No | .3019      0  .3019
      Yes | .0503  .6478  .6981
          | 
    Total | .3522  .6478      1
-------------------------------
  Key:  cell proportions

  Pearson:
    Uncorrected   chi2(1)         =  379.3900
    Design-based  F(1, 476)       =  427.4673     P = 0.0000

Finite population correction (FPC) assumes simple random sampling without 
replacement of PSUs within each stratum with no subsampling within PSUs.

Unlike the tabulate command, the svytab command requires two variables; in other words, it only makes two-way tables. If you want to make a one-way table, you need to create a constant variable and use it as one of the variables in the svytab command.

gen cons = 1
svytab both cons

pweight:  pw                                    Number of obs      =       477
Strata:   <one>                                 Number of strata   =         1
PSU:      <observations>                        Number of PSUs     =       477
FPC:      fpc                                   Population size    =      6194

------------------------
met both  |     cons    
targets   |     1  Total
----------+-------------
       No | .3019  .3019
      Yes | .6981  .6981
          | 
    Total |     1      1
------------------------
  Key:  cell proportions

  Only one column category.
  Statistics cannot be computed.

  Pearson:
    Uncorrected   chi2(0)         =         .
    Design-based  F(., .)         =         .     P =      .

Finite population correction (FPC) assumes simple random sampling without 
replacement of PSUs within each stratum with no subsampling within PSUs.

svyreg api00 award meals

Survey linear regression

pweight:  pw                                      Number of obs    =       477
Strata:   <one>                                   Number of strata =         1
PSU:      <observations>                          Number of PSUs   =       477
FPC:      fpc                                     Population size  =      6194
                                                  F(   2,    475)  =    679.67
                                                  Prob > F         =    0.0000
                                                  R-squared        =    0.6967

------------------------------------------------------------------------------
       api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      awards |   46.30969   7.237096     6.40   0.000     32.08908    60.53029
       meals |  -3.406531   .1056495   -32.24   0.000    -3.614128   -3.198934
       _cons |   791.0985   9.321325    84.87   0.000     772.7825    809.4146
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without 
replacement of PSUs within each stratum with no subsampling within PSUs.

One-stage cluster sampling in Stata

In a one-stage cluster sample, the data are divided into two “levels”, one “nested” in the other. At the first level, the data are grouped into clusters. In a one-stage cluster sample, clusters are selected first and are called primary sampling units, or PSUs. All of the elements in each selected cluster are selected into the sample. These elements represent the second “level” of the data. In our one-stage cluster sample, the districts will be the clusters and the schools will be the elementary or sampling units. Hence, we randomly select school districts and then select all schools within each selected district. You can use any sampling plan to select the clusters; we have used SRS only for the sake of simplicity.

Typically, data values in one cluster are more similar to one another than data values in another cluster. For example, if we surveyed people in households (e.g., people nested within households), we would expect that people in one household would be more similar to one another than they would be to people in another household. Unfortunately, this feature makes our standard errors less efficient. However, because of financial and/or logistical considerations, most surveys employ some sort of cluster sampling.

use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/oscs1, clear

svyset [pweight = pw], fpc(fpc) psu(dnum)
svydes

pweight:  pw
Strata:   <one>
PSU:      dnum
FPC:      fpc
                                      #Obs per PSU
 Strata                       ----------------------------
  <one>     #PSUs     #Obs       min      mean       max
--------  --------  --------  --------  --------  --------
       1       189      1463         1       7.7       100
--------  --------  --------  --------  --------  --------
       1       189      1463         1       7.7       100

svymean api00 growth

Survey mean estimation

pweight:  pw                                      Number of obs    =      1463
Strata:   <one>                                   Number of strata =         1
PSU:      dnum                                    Number of PSUs   =       189
FPC:      fpc                                     Population size  = 5859.7407

------------------------------------------------------------------------------
    Mean |   Estimate    Std. Err.   [95% Conf. Interval]        Deff
---------+--------------------------------------------------------------------
   api00 |   670.5202    11.09702    648.6295    692.4108    15.27665
  growth |   32.85783    1.440905    30.01541    35.70025    4.554066
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without 
replacement of PSUs within each stratum with no subsampling within PSUs.
Weights must represent population totals for deff to be correct when
using an FPC.  Note: deft is invariant to the scale of weights.

svytotal yr_rnd

Survey total estimation

pweight:  pw                                      Number of obs    =      1463
Strata:   <one>                                   Number of strata =         1
PSU:      dnum                                    Number of PSUs   =       189
FPC:      fpc                                     Population size  = 5859.7407

------------------------------------------------------------------------------
   Total |   Estimate    Std. Err.   [95% Conf. Interval]        Deff
---------+--------------------------------------------------------------------
  yr_rnd |   797.0529    176.0585    449.7489    1144.357     14.9672
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without 
replacement of PSUs within each stratum with no subsampling within PSUs.
Weights must represent population totals for deff to be correct when
using an FPC.  Note: deft is invariant to the scale of weights.

As you can see, the standard errors for these estimates are much larger than they were for any of the previous sampling plans. Although we don’t show an example here, you can easily combine stratification with cluster sampling, and this will help to make the standard errors more efficient.

Two-stage cluster sampling with stratification

In this last example, we will take a stratified two-stage cluster sample. As with the stratified random sample illustrated above, the sampling for each strata will be done independent of every other strata. A two-stage cluster sample means that clusters will be sampled (using whatever sampling plan the researcher chooses), and then elements within each of the selected clusters will also be sampled. This is different from what we did above in that, in a one-stage cluster sample, all of the elements in each selected cluster are selected into the sample. In a two-stage cluster sample, (usually) only some of the elements are selected into the sample. In our example, we will take an SRS of school districts (clusters), and then we will take an SRS of schools (elements). In the same way that you can use pretty much any sampling plan to select clusters, you can use pretty much any sampling plan to select elements from within the selected clusters; the sampling plan for selecting the clusters does not have to be the same as the one for selecting the elements. Also, you do not have to use the same sampling plan from one strata to the next, as the sampling between strata is independent. To obtain the sample used below, we first used the stratification that we used before, stratifying schools based on their mean api99 score. Next, we randomly selected 25% of the school districts from each strata. Finally, we randomly selected three schools from each selected district. The choice to select three schools, as opposed to selecting two or four schools, was rather arbitrary. However, when deciding how many elements to select from a cluster, remember that you need to have a sufficient number to get stable estimates; however, because data values within each cluster are likely correlated, taking lots of them is often a waste of resources: 200 elements probably won’t be any more informative than 100. (This, of course, depends on how strong the correlation is.)

use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/strataboth, clear

svyset [pweight = pwt], fpc(fpc) psu(dnum) strata(strata)
svydes

pweight:  pwt
Strata:   strata
PSU:      dnum
FPC:      fpc
                                      #Obs per PSU
 Strata                       ----------------------------
 strata     #PSUs     #Obs       min      mean       max
--------  --------  --------  --------  --------  --------
       1        94       227         1       2.4         3
       2        95       239         1       2.5         3
--------  --------  --------  --------  --------  --------
       2       189       466         1       2.5         3

svymean api00 growth

Survey mean estimation

pweight:  pwt                                     Number of obs    =       466
Strata:   strata                                  Number of strata =         2
PSU:      dnum                                    Number of PSUs   =       189
FPC:      fpc                                     Population size  = 6032.9042

------------------------------------------------------------------------------
    Mean |   Estimate    Std. Err.   [95% Conf. Interval]        Deff
---------+--------------------------------------------------------------------
   api00 |     681.84    10.44856    661.2278    702.4522    3.902558
  growth |   30.71763     2.22572    26.32688    35.10838    3.009538
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without 
replacement of PSUs within each stratum with no subsampling within PSUs.
Weights must represent population totals for deff to be correct when
using an FPC.  Note: deft is invariant to the scale of weights.

svytotal yr_rnd

Survey total estimation

pweight:  pwt                                     Number of obs    =       466
Strata:   strata                                  Number of strata =         2
PSU:      dnum                                    Number of PSUs   =       189
FPC:      fpc                                     Population size  = 6032.9042

------------------------------------------------------------------------------
   Total |   Estimate    Std. Err.   [95% Conf. Interval]        Deff
---------+--------------------------------------------------------------------
  yr_rnd |   718.9149    214.9205    294.9345    1142.895    6.092888
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without 
replacement of PSUs within each stratum with no subsampling within PSUs.
Weights must represent population totals for deff to be correct when
using an FPC.  Note: deft is invariant to the scale of weights.

svyreg api00 awards meals

Survey linear regression

pweight:  pwt                                     Number of obs    =       466
Strata:   strata                                  Number of strata =         2
PSU:      dnum                                    Number of PSUs   =       189
FPC:      fpc                                     Population size  = 6032.9042
                                                  F(   2,    186)  =    556.68
                                                  Prob > F         =    0.0000
                                                  R-squared        =    0.7114

------------------------------------------------------------------------------
       api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      awards |   66.19885   5.867421    11.28   0.000       54.624    77.77369
       meals |  -3.192264   .1135934   -28.10   0.000    -3.416353   -2.968175
       _cons |   772.7654    6.72774   114.86   0.000     759.4934    786.0374
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without 
replacement of PSUs within each stratum with no subsampling within PSUs.

We have seen examples of how to do OLS regression with survey data, so now let’s do a logistic regression. First, we need to recode our dependent variable so that is 0/1. Next, we issue the svylogit command. Note that there is no “svylogistic” command. If you want odds ratios, you can use the or option with svylogit. In this example, we use some new variables. The variable comp_imp1 is coded 0/1 and indicates if the school met a comparable improvement target; growth is the difference between the current year’s api score and last year’s api score; ell is the percent of English language learners; and mobility is the percent of students for whom this is their first year at the school.

svylogit comp_imp1 growth ell mobility

Survey logistic regression

pweight:  pwt                                     Number of obs    =       466
Strata:   strata                                  Number of strata =         2
PSU:      dnum                                    Number of PSUs   =       189
FPC:      fpc                                     Population size  = 6032.9042
                                                  F(   3,    185)  =     20.80
                                                  Prob > F         =    0.0000

------------------------------------------------------------------------------
   comp_imp1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      growth |   .1213203   .0159442     7.61   0.000     .0898667    .1527739
         ell |  -.0702944   .0119777    -5.87   0.000    -.0939231   -.0466657
    mobility |  -.0781154   .0202496    -3.86   0.000    -.1180624   -.0381684
       _cons |   .6391637   .3169899     2.02   0.045      .013828    1.264499
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without 
replacement of PSUs within each stratum with no subsampling within PSUs.

Now we will use a three-level variable to show the use of the test command. Please note that “svytest” is an out-of-date command. As you can see, the xi prefix works with the svy commands (and so does xi3).

xi: svylogit comp_imp1 growth ell mobility i.meals3

i.meals3          _Imeals3_1-3        (naturally coded; _Imeals3_1 omitted)

Survey logistic regression

pweight:  pwt                                     Number of obs    =       466
Strata:   strata                                  Number of strata =         2
PSU:      dnum                                    Number of PSUs   =       189
FPC:      fpc                                     Population size  = 6032.9042
                                                  F(   5,    183)  =     14.28
                                                  Prob > F         =    0.0000

------------------------------------------------------------------------------
   comp_imp1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      growth |   .1333139   .0177526     7.51   0.000     .0982928     .168335
         ell |  -.0335437   .0134298    -2.50   0.013    -.0600371   -.0070503
    mobility |  -.0528434   .0194839    -2.71   0.007      -.09128   -.0144068
  _Imeals3_2 |  -1.976366   .3789415    -5.22   0.000    -2.723916   -1.228817
  _Imeals3_3 |   -2.54474   .9051281    -2.81   0.005    -4.330314   -.7591659
       _cons |   .5236906   .2685344     1.95   0.053    -.0060555    1.053437
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without 
replacement of PSUs within each stratum with no subsampling within PSUs.

test  _Imeals3_2 _Imeals3_3

Adjusted Wald test

 ( 1)  _Imeals3_2 = 0
 ( 2)  _Imeals3_3 = 0

       F(  2,   186) =   15.94
            Prob > F =    0.0000

Summary of population values, estimates, standard errors, design effects and estimated population totals for each sampling plan

The table below summarizes the values obtained from the descriptive statistics that we ran under each of the sampling plans, as well as the estimated population size. It also contains the population values, which, of course, are not estimates, and hence do not have standard errors or design effects associated with them. Design effects are the ratio of the variance of the variable under the current sampling design to the estimated variance under simple random sampling. In other words, it is an estimate of efficiency of the current sampling design relative to simple random sampling. As you can see, the standard errors and the design effects for the stratified simple random sample are the smallest, followed closely by those for the simple random sample. The design effects obtained under the systematic sample are slightly larger, and they become even larger when cluster sampling is used. The largest design effects are obtained using stratified one-stage cluster sampling. Also notice that cluster sampling yields estimates of the population size that are considerably different from those obtained using other types of sampling plans. You should not assume that this pattern of results will be obtained every time these sampling plans are compared. Some plans that look relatively inefficient in this example may appear to be more efficient with other samples and/or other data.

	mean api00			mean growth			total yr_rnd			estimated population size
	estimate	standard error	design effect	estimate	standard error	design effect	estimate	standard error	design effect	estimated population size
population values	664.71	N/A	N/A	32.80	N/A	N/A	874	N/A	N/A	6194
SRS	663.26	7.21	1	33.85	1.67	1	719.30	110.03	1	6194
Stratified SRS	665.62	2.96	.35	33.27	1.10	.96	789.55	76.60	.95	6194
Systematic	656.31	5.66	1	33.09	1.23	1	779.12	90.45	1	6194
One-stage cluster	670.52	11.10	15.28	32.86	1.44	4.55	797.05	176.06	14.97	5860
Stratified two-stage cluster	681.84	10.45	3.90	30.72	2.23	3.01	818.92	214.92	6.09	6033