Survey Data Analysis in Stata
The purpose of this seminar is to explore how to analyze survey
data collected under different sampling plans using Stata. Other examples, including those using other survey data analysis packages, can be found at Choosing the Correct Analysis for Various Survey Designs. Before we begin looking at examples in Stata, we will quickly review some basic issues and concepts in survey data analysis.
Why do we need survey data analysis software?
Regular statistical software (that is not designed for survey data) analyzes data as if the data were collected using simple random sampling. For experimental and quasi-experimental designs, this is exactly what we want. However, when surveys are conducted, a simple random sample is rarely collected. Not only is it nearly impossible to do so, but it is not as efficient (both financially and statistically) as other sampling methods. When any sampling method other than simple random sampling is used, we usually need to use survey data analysis software to take into account the differences between the design that was used and simple random sampling. This is because the sampling design affects the calculation of the standard errors of the estimates. If you ignore the sampling design, e.g., if you assume simple random sampling when another type of sampling design was used, the standard errors will likely be underestimated, possibly leading to results that seem to be statistically significant, when in fact, they are not. The difference in point estimates and standard errors obtained using non-survey software and survey software with the design properly specified will vary from data set to data set, and even between variables within the same data set. While it may be possible to get reasonably accurate results using non-survey software, there is no practical way to know beforehand how far off the results from non-survey software will be.
Sampling designs
Most people do not conduct their own surveys. Rather, they use survey data that some agency or company collected and made available to the public. The documentation must be read carefully to find out what kind of sampling design was used to collect the data. This is very important because many of the estimates and standard errors are calculated differently for the different sampling designs. Hence, if you mis-specify the sampling design, the point estimates and standard errors will likely be wrong.
Below are some common features of many sampling designs.
Weights: There are many types of weights that can be associated with a survey. Perhaps the most common is the probability weight, called a pweight in Stata, which is used to denote the inverse of the probability of being included in the sample due to the sampling design (except for a certainty PSU, see below). The probability weight is calculated as N/n, where N = the number of elements in the population and n = the number of elements in the sample. For example, if a population has 10 elements and 3 are sampled at random with replacement, then the pweight would be 10/3 = 3.33. In a two-stage design, the probability weight is calculated as f1f2, which means that the inverse of the sampling fraction for the first stage is multiplied by the inverse of the sampling fraction for the second stage. Under many sampling plans, the sum of the probability weights will be a good estimate of the population size.
PSU: This is the primary sampling unit. This is the first unit that is sampled in the design. For example, school districts from California may be sampled and then schools within districts may be sampled. The school district would be the PSU. If states from the US were sampled, and then school districts from within each state, and then schools from within each district, then states would be the PSU. One does not need to use the same sampling method at all levels of sampling. For example, probability-proportional-to-size sampling may be used at level 1 (to select states), while cluster sampling is used at level 2 (to select school districts). In the case of a simple random sample, the PSUs and the elementary units are the same.
Strata: Stratification is a method of breaking up the population into different groups, often by demographic variables such as gender, race or SES. Once these groups have been defined, one samples from each group as if it were independent of all of the other groups. For example, if a sample is to be stratified on gender, men and women would be sampled independent of one another. This means that the pweights for men will likely be different from the pweights for the women. In most cases, you need to have two or more PSUs in each stratum. The purpose of stratification is to improve the precision of the estimates, and stratification works most effectively when the variance of the dependent variable is smaller within the strata than in the sample as a whole.
FPC: This is the finite population correction. This is used when the sampling fraction (the number of elements or respondents sampled relative to the population) becomes large. The FPC is used in the calculation of the standard error of the estimate. If the value of the FPC is close to 1, it will have little impact and can be safely ignored. In some survey data analysis programs, such as SUDAAN, this information will be needed if you specify that the data were collected without replacement (see below for a definition of “without replacement”). The formula for calculating the FPC is ((N-n)/(N-1))1/2, where N is the number of elements in the population and n is the number of elements in the sample. To see the impact of the FPC for samples of various proportions, suppose that you had a population of 10,000 elements.
Sample size (n) FPC 1 1.0000 10 .9995 100 .9950 500 .9747 1000 .9487 5000 .7071 9000 .3162
Sampling with and without replacement
Most samples collected in the real world are collected “without replacement”. This means that once a respondent has been selected to be in the sample and has participated in the survey, that particular respondent cannot be selected again to be in the sample. Many of the calculations change depending on if a sample is collected with or without replacement. Hence, programs like SUDAAN request that you specify if a survey sampling design was implemented with our without replacement, and an FPC is used if sampling without replacement is used, even if the value of the FPC is very close to one.
Examples
In the examples that follow, we have data that represent a population, and we will discuss the analysis of these survey data as if they had been collected under five sampling plans: simple random sampling, stratified random sampling, systematic sampling, one-stage cluster sampling and two-stage cluster sampling with stratification. The Stata code necessary to generate the samples using each of these sampling plans is shown here. The variables from the data set with which we will be working include api00 and api99, which is an aggregate of student test scores for each school, for the years 2000 and 1999, respectively; yr_rnd, which is a 0/1 variable indicating if the school is on a year-round calendar; awards, which indicates whether or not the school met their target; meals, which indicates the percentage of children receiving free or reduced-priced meals at school; both, which indicates that the school met both targets; and growth, which is the difference between the api scores in the current year and those of the last year.
One of the most important points to remember is that all svy commands can be used with any sampling plan. To help illustrate this, we will use the svymean and the svytotal commands with each sampling plan. Another important point is that the interpretation of the results from the svy commands is usually no different than the interpretation that you would have if you had used the equivalent non-survey command. For example, there is no special interpretation of regression coefficients just because you obtained them using svyreg instead of regress.
Simple random sample
We will start by showing how you can take a simple random sample (SRS) from you data file. While we will not go through the commands necessary for obtaining any other type of sample, we will go over how to draw an SRS. Simple random samples are very rare in actual practice; however, researchers will often draw an SRS of their data set so that they can work out their data analysis programs on a relatively small data file. This saves computing time and resources, as the analysis program may have to be run many times before it is satisfactory.
set mem 5m use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear count 6194 set seed 1003002849 sample 5 count 310
Because we have eliminated elements of our population to create our sample, we need to create pweights (probability weights). We selected 5% of the elements in our population into our sample, so our sampling fraction is 310/6194. The pweight is the inverse of the sampling fraction, or N/n, where N is the population total (6194) and n is the number of elements selected into the sample (310). Another way to think of this is: “How many elements (schools, people, whatever) in the population should each element in the sample represent?” Clearly, each school in our current sample should represent twenty schools in the population, so all of the p-weights will be the same; approximately 20.
gen pw = 6194/310
Next, we need to consider how large our sample is relative to our population to determine if we need to use a finite population correction. (For a quick review of FPCs, please see the summary at the beginning of this handout.) In Stata, we only need to give the population total, and Stata will make the necessary calculations to obtain the correct FPC. Note that the svyset command is very different in Stata 8 than it was in Stata 7.
gen fpc = 6194
We use the svyset command to tell Stata about the features of the sampling design that we have. In this case, we only need to specify the pweight and the FPC.
svyset [pweight=pw], fpc(fpc)
Next, we will use the svydes command to display the information Stata has regarding our sampling plan. As you can see, the number of PSUs and observations is the same, which reassures us that Stata understands that we have a simple random sample. We also see that there is only one strata, which is correct for this type of sampling plan. Note that once you have used the svyset command, Stata will remember this information for your entire session; you do not need to reissue this command (unless you want to change something). Also, if you save your data, Stata will save the survey information with the data set, so that when you open the data in your next session, the survey information will be used when you issue svy commands.
svydes
Strata: <one> PSU: <observations> FPC: fpc #Obs per PSU Strata ---------------------------- <one> #PSUs #Obs min mean max -------- -------- -------- -------- -------- -------- 1 310 310 1 1.0 1 -------- -------- -------- -------- -------- -------- 1 310 310 1 1.0 1
We will start our analysis of these data with some basic descriptive statistics. We will use the svymean and svytotal commands. The svymean command is used to estimate the mean of a variable in the population. In our example, we will estimate the mean for api00 and growth.
svymean api00 growth
Survey mean estimation
pweight: pw Number of obs = 310 Strata: <one> Number of strata = 1 PSU: <observations> Number of PSUs = 310 FPC: fpc Population size = 6193.9997
------------------------------------------------------------------------------ Mean | Estimate Std. Err. [95% Conf. Interval] Deff ---------+-------------------------------------------------------------------- api00 | 663.2645 7.21478 649.0682 677.4608 1 growth | 33.84516 1.667394 30.56428 37.12604 1 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs. Weights must represent population totals for deff to be correct when using an FPC. Note: deft is invariant to the scale of weights.
The svytotal command is used to get estimates of population totals. In our example, we will get an estimate of how many schools are on a year-round calendar. From the output of the svytotal command, we can see that approximately 719 schools are on a year-round calendar.
svytotal yr_rnd
Survey total estimation
pweight: pw Number of obs = 310 Strata: <one> Number of strata = 1 PSU: <observations> Number of PSUs = 310 FPC: fpc Population size = 6193.9997
------------------------------------------------------------------------------ Total | Estimate Std. Err. [95% Conf. Interval] Deff ---------+-------------------------------------------------------------------- yr_rnd | 719.3032 110.0291 502.8022 935.8042 1 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs. Weights must represent population totals for deff to be correct when using an FPC. Note: deft is invariant to the scale of weights.
We will now do a multiple regression. We will use api00 as the dependent variable and award and meals as independent variables. We can see from the output that the model is statistically significant (F = 464.21, p < .000), and that each of the predictors is also statistically significant. You can interpret the output from the svy commands in the same way that you would the non-svy command. In this example, you interpret the output from the svyreg command in the same way that you would the output from the regress command. Remember that the difference between the svyreg and the regress commands is how the standard errors are calculated. The svyreg command takes into account the survey sampling plan, while the regress command does not.
svyreg api00 awards meals
Survey linear regression
pweight: pw Number of obs = 310 Strata: <one> Number of strata = 1 PSU: <observations> Number of PSUs = 310 FPC: fpc Population size = 6193.9997 F( 2, 308) = 464.21 Prob > F = 0.0000 R-squared = 0.7124
------------------------------------------------------------------------------ api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- awards | 53.37164 9.047051 5.90 0.000 35.57002 71.17325 meals | -3.329605 .1285952 -25.89 0.000 -3.582638 -3.076571 _cons | 792.2924 11.52925 68.72 0.000 769.6066 814.9781 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs.
Stratified random sampling
The difference between the example above and the next example is that stratification has been added to the sampling design. For this example, we have calculated the mean of api99 and stratified schools based on this. Schools that were above the mean were placed into one strata, and schools that were below the mean were placed in the other strata. Simple random samples of schools were then drawn from each strata. Although we have created only two strata, in many public-use data sets, you can have dozens of strata.
We have used the svyset, clear(all) command here to show how it is used. After issuing the svyset command, we again use the svydes command to ensure that Stata is handling the survey design appropriately. Next, we use the svymean to obtain the estimated means of api00 and api99. We can compare these estimates to those obtained from the SRS above. (Please see the table at the end of this handout.)
use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/strsrs, clear svyset, clear(all) svyset [pweight = pw], strata(strat) fpc(fpc) svydes
pweight: pw Strata: strat PSU: <observations> FPC: fpc #Obs per PSU Strata ---------------------------- strat #PSUs #Obs min mean max -------- -------- -------- -------- -------- -------- 1 310 310 1 1.0 1 2 310 310 1 1.0 1 -------- -------- -------- -------- -------- -------- 2 620 620 1 1.0 1
Below we use the svymean command to get the population estimate of the mean of api00. Notice the value of the design effect, labeled Deff in the output (on the far right of the table). The design effect compares the current sampling design (in this case, stratified random sampling) with simple random sampling. Design effects of 1 (or close to 1) indicate that the current sampling design is about as efficient as a simple random sample. Design effects that are smaller than 1 indicate that the current design is more efficient than simple random sampling, while design effects that are larger than 1 indicate that the current sampling design is less efficient than simple random sampling. Here, we can see the benefit of the stratification: the design effect for api00 is .35, well below 1. However, you will remember that we stratified on the mean of api99, which is closely related to api00, the variable for which we are getting an estimate.
svymean api00 growth
Survey mean estimation
pweight: pw Number of obs = 620 Strata: strat Number of strata = 2 PSU: <observations> Number of PSUs = 620 FPC: fpc Population size = 6193.9997
------------------------------------------------------------------------------ Mean | Estimate Std. Err. [95% Conf. Interval] Deff ---------+-------------------------------------------------------------------- api00 | 665.6216 2.957053 659.8145 671.4287 .3468751 growth | 33.26666 1.10437 31.09788 35.43543 .9629825 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs. Weights must represent population totals for deff to be correct when using an FPC. Note: deft is invariant to the scale of weights.
In the results of the svytotal shown below, you will see that the design effect is not much smaller than 1; in other words, we get relatively little benefit from the stratification. That is because there is not much of a relationship between api99 and yr_rnd. The point here is that to be genuinely useful, you need to stratify on variable(s) closely related to the variable of interest. In many cases, this will mean that while stratification will make some estimates more efficient, it will not do so for others.
svytotal yr_rnd
Survey total estimation
pweight: pw Number of obs = 620 Strata: strat Number of strata = 2 PSU: <observations> Number of PSUs = 620 FPC: fpc Population size = 6193.9997
------------------------------------------------------------------------------ Total | Estimate Std. Err. [95% Conf. Interval] Deff ---------+-------------------------------------------------------------------- yr_rnd | 789.5516 76.59607 639.1314 939.9717 .9457493 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs. Weights must represent population totals for deff to be correct when using an FPC. Note: deft is invariant to the scale of weights.
When estimates are made for each strata, they are made independently of all other strata. In other words, the estimate of yr_rnd for strata 1 was made independently of the estimate for strata 2. Also note that the sum of the estimates for strata 1 and strata 2 equals the value shown above.
svytotal yr_rnd, by(strat)
Survey total estimation
pweight: pw Number of obs = 620 Strata: strat Number of strata = 2 PSU: <observations> Number of PSUs = 620 FPC: fpc Population size = 6193.9997
------------------------------------------------------------------------------ Total Subpop. | Estimate Std. Err. [95% Conf. Interval] Deff ---------------+-------------------------------------------------------------- yr_rnd | strat==1 | 639.7935 67.69423 506.8549 772.7321 .8870259 strat==2 | 149.7581 35.83923 79.37663 220.1395 .976068 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs. Weights must represent population totals for deff to be correct when using an FPC. Note: deft is invariant to the scale of weights.
Systematic sampling
Systematic sampling is just that: drawing a sample from elements that are ordered in a systematic way. For example, you might take a systematic sample of library books by selecting every k-th book from the books on the shelf. (Remember that librarians hate when people actually do this!) Of course, first you need to determine how large of a sample you want to select. There are 6194 schools in our sample, and we would like to use systematic sampling to select a sample of size 500. First, we need to determine the “rate” at which schools should be selected. We do this by dividing the number of elements (e.g., schools) by the number desired in the sample. Therefore, k = 6194/500 = 12.38, which we will round to 13. Hence, we will select every 13th school. We will also randomly select a number from 1 to 13 and start counting from there. In our example, we selected the number 4. Hence, we ordered the schools from lowest id number to highest id number, started with school number 4, and then selected into our sample every 13th school. After creating our sample, we follow the same procedure as before: open the correct data file, issue the svyset command, check to see that everything is OK with the svydes command, and then begin our analysis with descriptive statistics.
use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/systematic.dta, clear svyset [pweight = pw], fpc(fpc) svydes
pweight: pw Strata: <one> PSU: <observations> FPC: fpc #Obs per PSU Strata ---------------------------- <one> #PSUs #Obs min mean max -------- -------- -------- -------- -------- -------- 1 477 477 1 1.0 1 -------- -------- -------- -------- -------- -------- 1 477 477 1 1.0 1
Below we get the population estimates for the mean of api00 and growth.
svymean api00 growth
Survey mean estimation
pweight: pw Number of obs = 477 Strata: <one> Number of strata = 1 PSU: <observations> Number of PSUs = 477 FPC: fpc Population size = 6194
------------------------------------------------------------------------------ Mean | Estimate Std. Err. [95% Conf. Interval] Deff ---------+-------------------------------------------------------------------- api00 | 656.3061 5.655353 645.1935 667.4186 1 growth | 33.08595 1.226588 30.67576 35.49615 1 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs. Weights must represent population totals for deff to be correct when using an FPC. Note: deft is invariant to the scale of weights.
Notice that the design effect for all variables is 1. This is not necessarily because systematic sampling is always just as efficient as simple random sampling. Rather, it has to do with the information that you have given to Stata. The design effect is influenced by setting the strata and PSU. In both simple random sampling and systematic sampling, we set neither the strata or PSU. Hence, Stata “can’t tell the two sampling plans apart.” Because the specification of the sampling design is exactly the same as with simple random sampling, the design effect is 1. However, you can calculate the design effect by hand by dividing the variance of the variable of interest under the current sampling design by the variance of the same variable under simple random sampling. We did this and found that the design effects were very close to 1. We found them to be .96 for api00, .93 for growth and 1.2 for yr_rnd.
svytotal yr_rnd
Survey total estimation
pweight: pw Number of obs = 477 Strata: <one> Number of strata = 1 PSU: <observations> Number of PSUs = 477 FPC: fpc Population size = 6194
------------------------------------------------------------------------------ Total | Estimate Std. Err. [95% Conf. Interval] Deff ---------+-------------------------------------------------------------------- yr_rnd | 779.1195 90.44644 601.3958 956.8431 1 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs. Weights must represent population totals for deff to be correct when using an FPC. Note: deft is invariant to the scale of weights.
Below we show the use of the svytab command. This can be used to make one- and two-way crosstabulations. Here we will make a crosstab of both and awards. The values in the cells are proportions. The svytab command also gives us the chi-square test for these two variables. We can see that the relationship between them is statistically significant.
svytab both awards
pweight: pw Number of obs = 477 Strata: <one> Number of strata = 1 PSU: <observations> Number of PSUs = 477 FPC: fpc Population size = 6194
------------------------------- met both | eligible for awards targets | no yes Total ----------+-------------------- No | .3019 0 .3019 Yes | .0503 .6478 .6981 | Total | .3522 .6478 1 ------------------------------- Key: cell proportions
Pearson: Uncorrected chi2(1) = 379.3900 Design-based F(1, 476) = 427.4673 P = 0.0000
Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs.
Unlike the tabulate command, the svytab command requires two variables; in other words, it only makes two-way tables. If you want to make a one-way table, you need to create a constant variable and use it as one of the variables in the svytab command.
gen cons = 1 svytab both cons
pweight: pw Number of obs = 477 Strata: <one> Number of strata = 1 PSU: <observations> Number of PSUs = 477 FPC: fpc Population size = 6194
------------------------ met both | cons targets | 1 Total ----------+------------- No | .3019 .3019 Yes | .6981 .6981 | Total | 1 1 ------------------------ Key: cell proportions
Only one column category. Statistics cannot be computed.
Pearson: Uncorrected chi2(0) = . Design-based F(., .) = . P = .
Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs.
svyreg api00 award meals Survey linear regression
pweight: pw Number of obs = 477 Strata: <one> Number of strata = 1 PSU: <observations> Number of PSUs = 477 FPC: fpc Population size = 6194 F( 2, 475) = 679.67 Prob > F = 0.0000 R-squared = 0.6967
------------------------------------------------------------------------------ api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- awards | 46.30969 7.237096 6.40 0.000 32.08908 60.53029 meals | -3.406531 .1056495 -32.24 0.000 -3.614128 -3.198934 _cons | 791.0985 9.321325 84.87 0.000 772.7825 809.4146 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs.
One-stage cluster sampling in Stata
In a one-stage cluster sample, the data are divided into two “levels”, one “nested” in the other. At the first level, the data are grouped into clusters. In a one-stage cluster sample, clusters are selected first and are called primary sampling units, or PSUs. All of the elements in each selected cluster are selected into the sample. These elements represent the second “level” of the data. In our one-stage cluster sample, the districts will be the clusters and the schools will be the elementary or sampling units. Hence, we randomly select school districts and then select all schools within each selected district. You can use any sampling plan to select the clusters; we have used SRS only for the sake of simplicity.
Typically, data values in one cluster are more similar to one another than data values in another cluster. For example, if we surveyed people in households (e.g., people nested within households), we would expect that people in one household would be more similar to one another than they would be to people in another household. Unfortunately, this feature makes our standard errors less efficient. However, because of financial and/or logistical considerations, most surveys employ some sort of cluster sampling.
use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/oscs1, clear svyset [pweight = pw], fpc(fpc) psu(dnum) svydes
pweight: pw Strata: <one> PSU: dnum FPC: fpc #Obs per PSU Strata ---------------------------- <one> #PSUs #Obs min mean max -------- -------- -------- -------- -------- -------- 1 189 1463 1 7.7 100 -------- -------- -------- -------- -------- -------- 1 189 1463 1 7.7 100
svymean api00 growth
Survey mean estimation
pweight: pw Number of obs = 1463 Strata: <one> Number of strata = 1 PSU: dnum Number of PSUs = 189 FPC: fpc Population size = 5859.7407
------------------------------------------------------------------------------ Mean | Estimate Std. Err. [95% Conf. Interval] Deff ---------+-------------------------------------------------------------------- api00 | 670.5202 11.09702 648.6295 692.4108 15.27665 growth | 32.85783 1.440905 30.01541 35.70025 4.554066 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs. Weights must represent population totals for deff to be correct when using an FPC. Note: deft is invariant to the scale of weights.
svytotal yr_rnd
Survey total estimation
pweight: pw Number of obs = 1463 Strata: <one> Number of strata = 1 PSU: dnum Number of PSUs = 189 FPC: fpc Population size = 5859.7407
------------------------------------------------------------------------------ Total | Estimate Std. Err. [95% Conf. Interval] Deff ---------+-------------------------------------------------------------------- yr_rnd | 797.0529 176.0585 449.7489 1144.357 14.9672 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs. Weights must represent population totals for deff to be correct when using an FPC. Note: deft is invariant to the scale of weights.
As you can see, the standard errors for these estimates are much larger than they were for any of the previous sampling plans. Although we don’t show an example here, you can easily combine stratification with cluster sampling, and this will help to make the standard errors more efficient.
Two-stage cluster sampling with stratification
In this last example, we will take a stratified two-stage cluster sample. As with the stratified random sample illustrated above, the sampling for each strata will be done independent of every other strata. A two-stage cluster sample means that clusters will be sampled (using whatever sampling plan the researcher chooses), and then elements within each of the selected clusters will also be sampled. This is different from what we did above in that, in a one-stage cluster sample, all of the elements in each selected cluster are selected into the sample. In a two-stage cluster sample, (usually) only some of the elements are selected into the sample. In our example, we will take an SRS of school districts (clusters), and then we will take an SRS of schools (elements). In the same way that you can use pretty much any sampling plan to select clusters, you can use pretty much any sampling plan to select elements from within the selected clusters; the sampling plan for selecting the clusters does not have to be the same as the one for selecting the elements. Also, you do not have to use the same sampling plan from one strata to the next, as the sampling between strata is independent. To obtain the sample used below, we first used the stratification that we used before, stratifying schools based on their mean api99 score. Next, we randomly selected 25% of the school districts from each strata. Finally, we randomly selected three schools from each selected district. The choice to select three schools, as opposed to selecting two or four schools, was rather arbitrary. However, when deciding how many elements to select from a cluster, remember that you need to have a sufficient number to get stable estimates; however, because data values within each cluster are likely correlated, taking lots of them is often a waste of resources: 200 elements probably won’t be any more informative than 100. (This, of course, depends on how strong the correlation is.)
use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/strataboth, clear svyset [pweight = pwt], fpc(fpc) psu(dnum) strata(strata) svydes
pweight: pwt Strata: strata PSU: dnum FPC: fpc #Obs per PSU Strata ---------------------------- strata #PSUs #Obs min mean max -------- -------- -------- -------- -------- -------- 1 94 227 1 2.4 3 2 95 239 1 2.5 3 -------- -------- -------- -------- -------- -------- 2 189 466 1 2.5 3
svymean api00 growth
Survey mean estimation
pweight: pwt Number of obs = 466 Strata: strata Number of strata = 2 PSU: dnum Number of PSUs = 189 FPC: fpc Population size = 6032.9042
------------------------------------------------------------------------------ Mean | Estimate Std. Err. [95% Conf. Interval] Deff ---------+-------------------------------------------------------------------- api00 | 681.84 10.44856 661.2278 702.4522 3.902558 growth | 30.71763 2.22572 26.32688 35.10838 3.009538 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs. Weights must represent population totals for deff to be correct when using an FPC. Note: deft is invariant to the scale of weights.
svytotal yr_rnd
Survey total estimation
pweight: pwt Number of obs = 466 Strata: strata Number of strata = 2 PSU: dnum Number of PSUs = 189 FPC: fpc Population size = 6032.9042
------------------------------------------------------------------------------ Total | Estimate Std. Err. [95% Conf. Interval] Deff ---------+-------------------------------------------------------------------- yr_rnd | 718.9149 214.9205 294.9345 1142.895 6.092888 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs. Weights must represent population totals for deff to be correct when using an FPC. Note: deft is invariant to the scale of weights.
svyreg api00 awards meals
Survey linear regression
pweight: pwt Number of obs = 466 Strata: strata Number of strata = 2 PSU: dnum Number of PSUs = 189 FPC: fpc Population size = 6032.9042 F( 2, 186) = 556.68 Prob > F = 0.0000 R-squared = 0.7114
------------------------------------------------------------------------------ api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- awards | 66.19885 5.867421 11.28 0.000 54.624 77.77369 meals | -3.192264 .1135934 -28.10 0.000 -3.416353 -2.968175 _cons | 772.7654 6.72774 114.86 0.000 759.4934 786.0374 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs.
We have seen examples of how to do OLS regression with survey data, so now let’s do a logistic regression. First, we need to recode our dependent variable so that is 0/1. Next, we issue the svylogit command. Note that there is no “svylogistic” command. If you want odds ratios, you can use the or option with svylogit. In this example, we use some new variables. The variable comp_imp1 is coded 0/1 and indicates if the school met a comparable improvement target; growth is the difference between the current year’s api score and last year’s api score; ell is the percent of English language learners; and mobility is the percent of students for whom this is their first year at the school.
svylogit comp_imp1 growth ell mobility
Survey logistic regression
pweight: pwt Number of obs = 466 Strata: strata Number of strata = 2 PSU: dnum Number of PSUs = 189 FPC: fpc Population size = 6032.9042 F( 3, 185) = 20.80 Prob > F = 0.0000
------------------------------------------------------------------------------ comp_imp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- growth | .1213203 .0159442 7.61 0.000 .0898667 .1527739 ell | -.0702944 .0119777 -5.87 0.000 -.0939231 -.0466657 mobility | -.0781154 .0202496 -3.86 0.000 -.1180624 -.0381684 _cons | .6391637 .3169899 2.02 0.045 .013828 1.264499 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs.
Now we will use a three-level variable to show the use of the test command. Please note that “svytest” is an out-of-date command. As you can see, the xi prefix works with the svy commands (and so does xi3).
xi: svylogit comp_imp1 growth ell mobility i.meals3 i.meals3 _Imeals3_1-3 (naturally coded; _Imeals3_1 omitted)
Survey logistic regression
pweight: pwt Number of obs = 466 Strata: strata Number of strata = 2 PSU: dnum Number of PSUs = 189 FPC: fpc Population size = 6032.9042 F( 5, 183) = 14.28 Prob > F = 0.0000
------------------------------------------------------------------------------ comp_imp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- growth | .1333139 .0177526 7.51 0.000 .0982928 .168335 ell | -.0335437 .0134298 -2.50 0.013 -.0600371 -.0070503 mobility | -.0528434 .0194839 -2.71 0.007 -.09128 -.0144068 _Imeals3_2 | -1.976366 .3789415 -5.22 0.000 -2.723916 -1.228817 _Imeals3_3 | -2.54474 .9051281 -2.81 0.005 -4.330314 -.7591659 _cons | .5236906 .2685344 1.95 0.053 -.0060555 1.053437 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs.
test _Imeals3_2 _Imeals3_3
Adjusted Wald test
( 1) _Imeals3_2 = 0 ( 2) _Imeals3_3 = 0
F( 2, 186) = 15.94 Prob > F = 0.0000
Summary of population values, estimates, standard errors, design effects and estimated population totals for each sampling plan
The table below summarizes the values obtained from the descriptive statistics that we ran under each of the sampling plans, as well as the estimated population size. It also contains the population values, which, of course, are not estimates, and hence do not have standard errors or design effects associated with them. Design effects are the ratio of the variance of the variable under the current sampling design to the estimated variance under simple random sampling. In other words, it is an estimate of efficiency of the current sampling design relative to simple random sampling. As you can see, the standard errors and the design effects for the stratified simple random sample are the smallest, followed closely by those for the simple random sample. The design effects obtained under the systematic sample are slightly larger, and they become even larger when cluster sampling is used. The largest design effects are obtained using stratified one-stage cluster sampling. Also notice that cluster sampling yields estimates of the population size that are considerably different from those obtained using other types of sampling plans. You should not assume that this pattern of results will be obtained every time these sampling plans are compared. Some plans that look relatively inefficient in this example may appear to be more efficient with other samples and/or other data.
mean api00 | mean growth | total yr_rnd | estimated population size | |||||||
estimate | standard error | design effect | estimate | standard error | design effect | estimate | standard error | design effect | ||
population values | 664.71 | N/A | N/A | 32.80 | N/A | N/A | 874 | N/A | N/A | 6194 |
SRS | 663.26 | 7.21 | 1 | 33.85 | 1.67 | 1 | 719.30 | 110.03 | 1 | 6194 |
Stratified SRS | 665.62 | 2.96 | .35 | 33.27 | 1.10 | .96 | 789.55 | 76.60 | .95 | 6194 |
Systematic | 656.31 | 5.66 | 1 | 33.09 | 1.23 | 1 | 779.12 | 90.45 | 1 | 6194 |
One-stage cluster | 670.52 | 11.10 | 15.28 | 32.86 | 1.44 | 4.55 | 797.05 | 176.06 | 14.97 | 5860 |
Stratified two-stage cluster | 681.84 | 10.45 | 3.90 | 30.72 | 2.23 | 3.01 | 818.92 | 214.92 | 6.09 | 6033 |