The purpose of this seminar is to explore how to analyze survey data collected under different sampling plans using Stata 9. Other examples, including those using other survey data analysis packages, can be found at Choosing the Correct Analysis for Various Survey Designs. Before we begin looking at examples in Stata, we will quickly review some basic issues and concepts in survey data analysis.
Why do we need survey data analysis software?
Regular statistical software (that is not designed for survey data) analyzes data as if the data were collected using simple random sampling. For experimental and quasi-experimental designs, this is exactly what we want. However, when surveys are conducted, a simple random sample is rarely collected. Not only is it nearly impossible to do so, but it is not as efficient (both financially and statistically) as other sampling methods. When any sampling method other than simple random sampling is used, we usually need to use survey data analysis software to take into account the differences between the design that was used and simple random sampling. This is because the sampling design affects the calculation of the standard errors of the estimates. If you ignore the sampling design, e.g., if you assume simple random sampling when another type of sampling design was used, the standard errors will likely be underestimated, possibly leading to results that seem to be statistically significant, when in fact, they are not. The difference in point estimates and standard errors obtained using non-survey software and survey software with the design properly specified will vary from data set to data set, and even between variables within the same data set. While it may be possible to get reasonably accurate results using non-survey software, there is no practical way to know beforehand how far off the results from non-survey software will be.
Sampling designs
Most people do not conduct their own surveys. Rather, they use survey data that some agency or company collected and made available to the public. The documentation must be read carefully to find out what kind of sampling design was used to collect the data. This is very important because many of the estimates and standard errors are calculated differently for the different sampling designs. Hence, if you mis-specify the sampling design, the point estimates and standard errors will likely be wrong.
Below are some common features of many sampling designs.
Weights: There are many types of weights that can be associated with a survey. Perhaps the most common is the sampling weight, sometimes called a pweight, which is used to denote the inverse of the probability of being included in the sample due to the sampling design (except for a certainty PSU, see below). The pweight is calculated as N/n, where N = the number of elements in the population and n = the number of elements in the sample. For example, if a population has 10 elements and 3 are sampled at random with replacement, then the pweight would be 10/3 = 3.33. In a two-stage design, the pweight is calculated as f1f2, which means that the inverse of the sampling fraction for the first stage is multiplied by the inverse of the sampling fraction for the second stage. Under many sampling plans, the sum of the pweights will equal the population total.
PSU: This is the primary sampling unit. This is the first unit that is sampled in the design. For example, school districts from California may be sampled and then schools within districts may be sampled. The school district would be the PSU. If states from the US were sampled, and then school districts from within each state, and then schools from within each district, then states would be the PSU. One does not need to use the same sampling method at all levels of sampling. For example, probability-proportional-to-size sampling may be used at level 1 (to select states), while cluster sampling is used at level 2 (to select school districts). In the case of a simple random sample, the PSUs and the elementary units are the same.
Strata: Stratification is a method of breaking up the population into different groups, often by demographic variables such as gender, race or SES. Once these groups have been defined, one samples from each group as if it were independent of all of the other groups. For example, if a sample is to be stratified on gender, men and women would be sampled independent of one another. This means that the pweights for men will likely be different from the pweights for the women. In most cases, you need to have two or more PSUs in each stratum. The purpose of stratification is to improve the precision of the estimates, and stratification works most effectively when the variance of the dependent variable is smaller within the strata than in the sample as a whole.
FPC: This is the finite population correction. This is used when the sampling fraction (the number of elements or respondents sampled relative to the population) becomes large. The FPC is used in the calculation of the standard error of the estimate. If the value of the FPC is close to 1, it will have little impact and can be safely ignored. In some survey data analysis programs, such as SUDAAN, this information will be needed if you specify that the data were collected without replacement (see below for a definition of “without replacement”). The formula for calculating the FPC is ((N-n)/(N-1))1/2, where N is the number of elements in the population and n is the number of elements in the sample. To see the impact of the FPC for samples of various proportions, suppose that you had a population of 10,000 elements.
Sample size (n) FPC 1 1.0000 10 .9995 100 .9950 500 .9747 1000 .9487 5000 .7071 9000 .3162
Sampling with and without replacement
Most samples collected in the real world are collected “without replacement”. This means that once a respondent has been selected to be in the sample and has participated in the survey, that particular respondent cannot be selected again to be in the sample. Many of the calculations change depending on if a sample is collected with or without replacement. Hence, programs like SUDAAN request that you specify if a survey sampling design was implemented with our without replacement, and an FPC is used if sampling without replacement is used, even if the value of the FPC is very close to one.
Examples
In the examples that follow, we have data that represent a population, and we will discuss the analysis of these survey data as if they had been collected under five sampling plans: simple random sampling, stratified random sampling, systematic sampling, one-stage cluster sampling and two-stage cluster sampling with stratification. The Stata code necessary to generate the samples using each of these sampling plans is shown here. The variables from the data set with which we will be working include api00 and api99, which is an aggregate of student test scores for each school, for the years 2000 and 1999, respectively; yr_rnd, which is a 0/1 variable indicating if the school is on a year-round calendar; awards, which indicates whether or not the school met their target; meals, which indicates the percentage of children receiving free or reduced-priced meals at school; both, which indicates that the school met both targets; and growth, which is the difference between the api scores in the current year and those of the last year.
One of the most important points to remember is that all svy commands can be used with any sampling plan. To help illustrate this, we will use the svy: mean and the svy: total commands with each sampling plan. Another important point is that the interpretation of the results from the svy commands is usually no different than the interpretation that you would have if you had used the equivalent non-survey command. For example, there is no special interpretation of regression coefficients just because you obtained them using svy: reg instead of regress.
Simple random sample
We will start by showing how you can take a simple random sample (SRS) from you data file. While we will not go through the commands necessary for obtaining any other type of sample, we will go over how to draw an SRS. Simple random samples are very rare in actual practice; however, researchers will often draw an SRS of their data set so that they can work out their data analysis programs on a relatively small data file. This saves computing time and resources, as the analysis program may have to be run many times before it is satisfactory.
set mem 5m use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear count 6194 set seed 1003002849 sample 5 count 310
Because we have eliminated elements of our population to create our sample, we need to create pweights (probability weights). We selected 5% of the elements in our population into our sample, so our sampling fraction is 310/6194. The pweight is the inverse of the sampling fraction, or N/n, where N is the population total (6194) and n is the number of elements selected into the sample (310). Another way to think of this is: “How many elements (schools, people, whatever) in the population should each element in the sample represent?” Clearly, each school in our current sample should represent twenty schools in the population, so all of the p-weights will be the same; approximately 20.
gen pw = 6194/310
Next, we need to consider how large our sample is relative to our population to determine if we need to use a finite population correction. (For a quick review of FPCs, please see the summary at the beginning of this handout.) In Stata, we only need to give the population total, and Stata will make the necessary calculations to obtain the correct FPC. Note that the svyset command is very different in Stata 8 than it was in Stata 7.
gen fpc = 6194
We use the svyset command to tell Stata about the features of the sampling design that we have. In this case, we only need to specify the pweight and the FPC.
svyset [pweight=pw], fpc(fpc)
pweight: pw VCE: linearized Strata 1: <one> SU 1: <observations> FPC 1: fpc
Next, we will use the svydes command to display the information Stata has regarding our sampling plan. As you can see, the number of PSUs and observations is the same, which reassures us that Stata understands that we have a simple random sample. We also see that there is only one strata, which is correct for this type of sampling plan. Note that once you have used the svyset command, Stata will remember this information for your entire session; you do not need to reissue this command (unless you want to change something). Also, if you save your data, Stata will save the survey information with the data set, so that when you open the data in your next session, the survey information will be used when you issue svy commands.
svydesSurvey: Describing stage 1 sampling units pweight: pw VCE: linearized Strata 1: <one> SU 1: <observations> FPC 1: fpc #Obs per Unit ---------------------------- Stratum #Units #Obs min mean max -------- -------- -------- -------- -------- -------- 1 310 310 1 1.0 1 -------- -------- -------- -------- -------- -------- 1 310 310 1 1.0 1
We will start our analysis of these data with some basic descriptive statistics. We will use the svy: mean and svy: total commands. The svy: mean command is used to estimate the mean of a variable in the population. In our example, we will estimate the mean for api00 and growth. Please note that svy: mean is an estimation command, and Stata will do a listwise deletion of missing data. For example, if we had missing data on api00, we would probably get a different mean for growth than if we issued the command svy: mean growth because there would a different number of cases used in the calculation of the mean.
svy: mean api00 growth(running mean on estimation sample) Survey: Mean estimation Number of strata = 1 Number of obs = 310 Number of PSUs = 310 Population size = 6194 Design df = 309 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ api00 | 663.2645 7.21478 649.0682 677.4608 growth | 33.84516 1.667394 30.56428 37.12604 --------------------------------------------------------------
The svy: total command is used to get estimates of population totals. In our example, we will get an estimate of how many schools are on a year-round calendar. From the output of the svy: total command, we can see that approximately 719 schools are on a year-round calendar.
svy: total yr_rnd(running total on estimation sample) Survey: Total estimation Number of strata = 1 Number of obs = 310 Number of PSUs = 310 Population size = 6194 Design df = 309 -------------------------------------------------------------- | Linearized | Total Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ yr_rnd | 719.3032 110.0291 502.8022 935.8042 --------------------------------------------------------------
We will now do a multiple regression. We will use api00 as the dependent variable and award and meals as independent variables. We can see from the output that the model is statistically significant (F = 464.21, p < .000), and that each of the predictors is also statistically significant. You can interpret the output from the svy commands in the same way that you would the non-svy command. In this example, you interpret the output from the svy: reg command in the same way that you would the output from the regress command. Remember that the difference between the svy: reg and the regress commands is how the standard errors are calculated. The svy: reg command takes into account the survey sampling plan, while the regress command does not.
svy: reg api00 awards meals(running regress on estimation sample) Survey: Linear regression Number of strata = 1 Number of obs = 310 Number of PSUs = 310 Population size = 6193.9997 Design df = 309 F( 2, 308) = 464.21 Prob > F = 0.0000 R-squared = 0.7124 ------------------------------------------------------------------------------ | Linearized api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- awards | 53.37164 9.047051 5.90 0.000 35.57002 71.17326 meals | -3.329605 .1285952 -25.89 0.000 -3.582638 -3.076571 _cons | 738.9207 19.47419 37.94 0.000 700.6019 777.2395 ------------------------------------------------------------------------------
Stratified random sampling
The difference between the example above and the next example is that stratification has been added to the sampling design. For this example, we have calculated the mean of api99 and stratified schools based on this. Schools that were above the mean were placed into one strata, and schools that were below the mean were placed in the other strata. Simple random samples of schools were then drawn from each strata. Although we have created only two strata, in many public-use data sets, you can have dozens of strata.
We have used the svyset, clear(all) command here to show how it is used. After issuing the svyset command, we again use the svydes command to ensure that Stata is handling the survey design appropriately. Next, we use the svymean to obtain the estimated means of api00 and api99. We can compare these estimates to those obtained from the SRS above. (Please see the table at the end of this handout.)
use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/strsrs, clear svyset, clear(all) svyset [pweight = pw], strata(strat) fpc(fpc)pweight: pw VCE: linearized Strata 1: strat SU 1: <observations> FPC 1: fpc svydes Survey: Describing stage 1 sampling units pweight: pw VCE: linearized Strata 1: strat SU 1: <observations> FPC 1: fpc #Obs per Unit ---------------------------- Stratum #Units #Obs min mean max -------- -------- -------- -------- -------- -------- 1 310 310 1 1.0 1 2 310 310 1 1.0 1 -------- -------- -------- -------- -------- -------- 2 620 620 1 1.0 1
Below we use the svy: mean command to get the population estimate of the mean of api00. We can use the estat effects command to get the design effect. Notice the value of the design effect, labeled Deff in the output. The design effect compares the current sampling design (in this case, stratified random sampling) with simple random sampling. Design effects of 1 (or close to 1) indicate that the current sampling design is about as efficient as a simple random sample. Design effects that are smaller than 1 indicate that the current design is more efficient than simple random sampling, while design effects that are larger than 1 indicate that the current sampling design is less efficient than simple random sampling. Here, we can see the benefit of the stratification: the design effect for api00 is .35, well below 1. However, you will remember that we stratified on the mean of api99, which is closely related to api00, the variable for which we are getting an estimate.
svy: mean api00 growth(running mean on estimation sample) Survey: Mean estimation Number of strata = 2 Number of obs = 620 Number of PSUs = 620 Population size = 6194 Design df = 618 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ api00 | 665.6216 2.957053 659.8145 671.4287 growth | 33.26666 1.10437 31.09788 35.43543 --------------------------------------------------------------estat effects ---------------------------------------------------------- | Linearized | Mean Std. Err. Deff Deft -------------+-------------------------------------------- api00 | 665.6216 2.957053 .346875 .558707 growth | 33.26666 1.10437 .962983 .930909 ---------------------------------------------------------- Note: Weights must represent population totals for deff to be correct when using an FPC; however, deft is invariant to the scale of weights.
In the results of the svy: total shown below, you will see that the design effect is not much smaller than 1; in other words, we get relatively little benefit from the stratification. That is because there is not much of a relationship between api99 and yr_rnd. The point here is that to be genuinely useful, you need to stratify on variable(s) closely related to the variable of interest. In many cases, this will mean that while stratification will make some estimates more efficient, it will not do so for others.
svy: total yr_rnd(running total on estimation sample) Survey: Total estimation Number of strata = 2 Number of obs = 620 Number of PSUs = 620 Population size = 6194 Design df = 618 -------------------------------------------------------------- | Linearized | Total Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ yr_rnd | 789.5516 76.59607 639.1314 939.9717 --------------------------------------------------------------
When estimates are made for each strata, they are made independently of all other strata. In other words, the estimate of yr_rnd for strata 1 was made independently of the estimate for strata 2. Also note that the sum of the estimates for strata 1 and strata 2 equals the value shown above.
svy: total yr_rnd, over(strat) (running total on estimation sample) Survey: Total estimation Number of strata = 2 Number of obs = 620 Number of PSUs = 620 Population size = 6194 Design df = 618 1: strat = 1 2: strat = 2 -------------------------------------------------------------- | Linearized Over | Total Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ yr_rnd | 1 | 639.7935 67.69423 506.8549 772.7321 2 | 149.7581 35.83923 79.37663 220.1395 --------------------------------------------------------------
Systematic sampling
Systematic sampling is just that: drawing a sample from elements that are ordered in a systematic way. For example, you might take a systematic sample of library books by selecting every k-th book from the books on the shelf. (Remember that librarians hate when people actually do this!) Of course, first you need to determine how large of a sample you want to select. There are 6194 schools in our sample, and we would like to use systematic sampling to select a sample of size 500. First, we need to determine the “rate” at which schools should be selected. We do this by dividing the number of elements (e.g., schools) by the number desired in the sample. Therefore, k = 6194/500 = 12.38, which we will round to 13. Hence, we will select every 13th school. We will also randomly select a number from 1 to 13 and start counting from there. In our example, we selected the number 4. Hence, we ordered the schools from lowest id number to highest id number, started with school number 4, and then selected into our sample every 13th school. After creating our sample, we follow the same procedure as before: open the correct data file, issue the svyset command, check to see that everything is OK with the svydes command, and then begin our analysis with descriptive statistics.
use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/systematic.dta, clear svyset [pweight = pw], fpc(fpc)pweight: pw VCE: linearized Strata 1: <one> SU 1: <observations> FPC 1: fpc svydes Survey: Describing stage 1 sampling units pweight: pw VCE: linearized Strata 1: <one> SU 1: <observations> FPC 1: fpc #Obs per Unit ---------------------------- Stratum #Units #Obs min mean max -------- -------- -------- -------- -------- -------- 1 477 477 1 1.0 1 -------- -------- -------- -------- -------- -------- 1 477 477 1 1.0 1
Below we get the population estimates for the mean of api00 and growth.
svy: mean api00 growth(running mean on estimation sample) Survey: Mean estimation Number of strata = 1 Number of obs = 477 Number of PSUs = 477 Population size = 6194 Design df = 476 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ api00 | 656.3061 5.655353 645.1935 667.4186 growth | 33.08595 1.226588 30.67576 35.49615 --------------------------------------------------------------estat effects ---------------------------------------------------------- | Linearized | Mean Std. Err. Deff Deft -------------+-------------------------------------------- api00 | 656.3061 5.655353 1 .960724 growth | 33.08595 1.226588 1 .960724 ---------------------------------------------------------- Note: Weights must represent population totals for deff to be correct when using an FPC; however, deft is invariant to the scale of weights.
Notice that the design effect for all variables is 1. This is not necessarily because systematic sampling is always just as efficient as simple random sampling. Rather, it has to do with the information that you have given to Stata. The design effect is influenced by setting the strata and PSU. In both simple random sampling and systematic sampling, we set neither the strata or PSU. Hence, Stata “can’t tell the two sampling plans apart.” Because the specification of the sampling design is exactly the same as with simple random sampling, the design effect is 1. However, you can calculate the design effect by hand by dividing the variance of the variable of interest under the current sampling design by the variance of the same variable under simple random sampling. We did this and found that the design effects were very close to 1. We found them to be .96 for api00, .93 for growth and 1.2 for yr_rnd.
svy: total yr_rnd(running total on estimation sample) Survey: Total estimation Number of strata = 1 Number of obs = 477 Number of PSUs = 477 Population size = 6194 Design df = 476 -------------------------------------------------------------- | Linearized | Total Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ yr_rnd | 779.1195 90.44644 601.3958 956.8432 --------------------------------------------------------------
Below we show the use of the svy: tab command. This can be used to make one- and two-way crosstabulations. Here we will make a crosstab of both and awards. The values in the cells are proportions. You can use the count option (as shown below) to obtain the counts in each cell. The svy: tab command also gives us the chi-square test for these two variables. We can see that the relationship between them is statistically significant.
svy: tab both awards(running tabulate on estimation sample) Number of strata = 1 Number of obs = 477 Number of PSUs = 477 Population size = 6194 Design df = 476 ------------------------------- met both | eligible for awards targets | no yes Total ----------+-------------------- No | .3019 0 .3019 Yes | .0503 .6478 .6981 | Total | .3522 .6478 1 ------------------------------- Key: cell proportions Pearson: Uncorrected chi2(1) = 379.3900 Design-based F(1, 476) = 427.4673 P = 0.0000svy: tab both awards, count (running tabulate on estimation sample) Number of strata = 1 Number of obs = 477 Number of PSUs = 477 Population size = 6194 Design df = 476 ------------------------------- met both | eligible for awards targets | no yes Total ----------+-------------------- No | 1870 0 1870 Yes | 311.6 4012 4324 | Total | 2182 4012 6194 ------------------------------- Key: weighted counts Pearson: Uncorrected chi2(1) = 379.3900 Design-based F(1, 476) = 427.4673 P = 0.0000svy: reg api00 award meals (running regress on estimation sample) Survey: Linear regression Number of strata = 1 Number of obs = 477 Number of PSUs = 477 Population size = 6194 Design df = 476 F( 2, 475) = 679.67 Prob > F = 0.0000 R-squared = 0.6967 ------------------------------------------------------------------------------ | Linearized api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- awards | 46.30969 7.237096 6.40 0.000 32.08908 60.53029 meals | -3.406531 .1056495 -32.24 0.000 -3.614128 -3.198934 _cons | 791.0985 9.321325 84.87 0.000 772.7825 809.4146 ------------------------------------------------------------------------------
One-stage cluster sampling in Stata
In a one-stage cluster sample, the data are divided into two “levels”, one “nested” in the other. At the first level, the data are grouped into clusters. In a one-stage cluster sample, clusters are selected first and are called primary sampling units, or PSUs. All of the elements in each selected cluster are selected into the sample. These elements represent the second “level” of the data. In our one-stage cluster sample, the districts will be the clusters and the schools will be the elementary or sampling units. Hence, we randomly select school districts and then select all schools within each selected district. You can use any sampling plan to select the clusters; we have used SRS only for the sake of simplicity.
Typically, data values in one cluster are more similar to one another than data values in another cluster. For example, if we surveyed people in households (e.g., people nested within households), we would expect that people in one household would be more similar to one another than they would be to people in another household. Unfortunately, this feature makes our standard errors less efficient. However, because of financial and/or logistical considerations, most surveys employ some sort of cluster sampling.
use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/oscs1, clear svyset dnum [pweight = pw], fpc(fpc) pweight: pw VCE: linearized Strata 1: <one> SU 1: dnum FPC 1: fpc svydes Survey: Describing stage 1 sampling units pweight: pw VCE: linearized Strata 1: <one> SU 1: dnum FPC 1: fpc #Obs per Unit ---------------------------- Stratum #Units #Obs min mean max -------- -------- -------- -------- -------- -------- 1 189 1463 1 7.7 100 -------- -------- -------- -------- -------- -------- 1 189 1463 1 7.7 100svy: mean api00 growth(running mean on estimation sample) Survey: Mean estimation Number of strata = 1 Number of obs = 1463 Number of PSUs = 189 Population size = 5859.74 Design df = 188 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ api00 | 670.5202 11.09702 648.6295 692.4108 growth | 32.85783 1.440905 30.01541 35.70025 --------------------------------------------------------------svy: total yr_rnd(running total on estimation sample) Survey: Total estimation Number of strata = 1 Number of obs = 1463 Number of PSUs = 189 Population size = 5859.74 Design df = 188 -------------------------------------------------------------- | Linearized | Total Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ yr_rnd | 797.0529 176.0585 449.7489 1144.357 --------------------------------------------------------------
As you can see, the standard errors for these estimates are much larger than they were for any of the previous sampling plans. Although we don’t show an example here, you can easily combine stratification with cluster sampling, and this will help to make the standard errors more efficient.
Two-stage cluster sampling with stratification
In this last example, we will take a stratified two-stage cluster sample. As with the stratified random sample illustrated above, the sampling for each strata will be done independent of every other strata. A two-stage cluster sample means that clusters will be sampled (using whatever sampling plan the researcher chooses), and then elements within each of the selected clusters will also be sampled. This is different from what we did above in that, in a one-stage cluster sample, all of the elements in each selected cluster are selected into the sample. In a two-stage cluster sample, (usually) only some of the elements are selected into the sample. In our example, we will take an SRS of school districts (clusters), and then we will take an SRS of schools (elements). In the same way that you can use pretty much any sampling plan to select clusters, you can use pretty much any sampling plan to select elements from within the selected clusters; the sampling plan for selecting the clusters does not have to be the same as the one for selecting the elements. Also, you do not have to use the same sampling plan from one strata to the next, as the sampling between strata is independent. To obtain the sample used below, we first used the stratification that we used before, stratifying schools based on their mean api99 score. Next, we randomly selected 25% of the school districts from each strata. Finally, we randomly selected three schools from each selected district. The choice to select three schools, as opposed to selecting two or four schools, was rather arbitrary. However, when deciding how many elements to select from a cluster, remember that you need to have a sufficient number to get stable estimates; however, because data values within each cluster are likely correlated, taking lots of them is often a waste of resources: 200 elements probably won’t be any more informative than 100. (This, of course, depends on how strong the correlation is.)
use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/strataboth, clear svyset dnum [pweight = pwt], fpc(fpc) strata(strata)pweight: pwt VCE: linearized Strata 1: strata SU 1: dnum FPC 1: fpc svydes Survey: Describing stage 1 sampling units pweight: pwt VCE: linearized Strata 1: strata SU 1: dnum FPC 1: fpc #Obs per Unit ---------------------------- Stratum #Units #Obs min mean max -------- -------- -------- -------- -------- -------- 1 94 227 1 2.4 3 2 95 239 1 2.5 3 -------- -------- -------- -------- -------- -------- 2 189 466 1 2.5 3svy: mean api00 growth(running mean on estimation sample) Survey: Mean estimation Number of strata = 2 Number of obs = 466 Number of PSUs = 189 Population size = 6032.9 Design df = 187 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ api00 | 681.84 10.44856 661.2278 702.4522 growth | 30.71763 2.22572 26.32688 35.10838 --------------------------------------------------------------svy: total yr_rnd(running total on estimation sample) Survey: Total estimation Number of strata = 2 Number of obs = 466 Number of PSUs = 189 Population size = 6032.9 Design df = 187 -------------------------------------------------------------- | Linearized | Total Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ yr_rnd | 718.9149 214.9205 294.9345 1142.895 --------------------------------------------------------------svy: reg api00 awards meals(running regress on estimation sample) Survey: Linear regression Number of strata = 2 Number of obs = 466 Number of PSUs = 189 Population size = 6032.9042 Design df = 187 F( 2, 186) = 556.68 Prob > F = 0.0000 R-squared = 0.7114 ------------------------------------------------------------------------------ | Linearized api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- awards | 66.19885 5.867421 11.28 0.000 54.624 77.77369 meals | -3.192264 .1135934 -28.10 0.000 -3.416353 -2.968175 _cons | 772.7654 6.72774 114.86 0.000 759.4934 786.0374 ------------------------------------------------------------------------------
We have seen examples of how to do OLS regression with survey data, so now let’s do a logistic regression. First, we need to recode our dependent variable so that is 0/1. Next, we issue the svy: logit command. If you want odds ratios, you can use the or option with svy: logit. In this example, we use some new variables. The variable comp_imp1 is coded 0/1 and indicates if the school met a comparable improvement target; growth is the difference between the current year’s api score and last year’s api score; ell is the percent of English language learners; and mobility is the percent of students for whom this is their first year at the school.
svy: logit comp_imp1 growth ell mobility(running logit on estimation sample) Survey: Logistic regression Number of strata = 2 Number of obs = 466 Number of PSUs = 189 Population size = 6032.9042 Design df = 187 F( 3, 185) = 20.80 Prob > F = 0.0000 ------------------------------------------------------------------------------ | Linearized comp_imp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- growth | .1213203 .0159442 7.61 0.000 .0898667 .1527739 ell | -.0702944 .0119777 -5.87 0.000 -.0939231 -.0466657 mobility | -.0781154 .0202496 -3.86 0.000 -.1180624 -.0381684 _cons | .6391637 .3169899 2.02 0.045 .013828 1.264499 ------------------------------------------------------------------------------
Now we will use a three-level variable to show the use of the test command. Please note that “svytest” is an out-of-date command. As you can see, the xi prefix works with the svy commands (and so does xi3). However, you need to use the prefixes in the correct order: “svy: xi: logit” does not work.
xi: svy: logit comp_imp1 growth ell mobility i.meals3 i.meals3 _Imeals3_1-3 (naturally coded; _Imeals3_1 omitted) (running logit on estimation sample) Survey: Logistic regression Number of strata = 2 Number of obs = 466 Number of PSUs = 189 Population size = 6032.9042 Design df = 187 F( 5, 183) = 14.28 Prob > F = 0.0000 ------------------------------------------------------------------------------ | Linearized comp_imp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- growth | .1333139 .0177526 7.51 0.000 .0982928 .168335 ell | -.0335437 .0134298 -2.50 0.013 -.0600371 -.0070503 mobility | -.0528434 .0194839 -2.71 0.007 -.09128 -.0144068 _Imeals3_2 | -1.976366 .3789415 -5.22 0.000 -2.723916 -1.228817 _Imeals3_3 | -2.54474 .9051281 -2.81 0.005 -4.330314 -.7591659 _cons | .5236906 .2685344 1.95 0.053 -.0060555 1.053437 ------------------------------------------------------------------------------test _Imeals3_2 _Imeals3_3Adjusted Wald test
( 1) _Imeals3_2 = 0 ( 2) _Imeals3_3 = 0
F( 2, 186) = 15.94 Prob > F = 0.0000
Summary of population values, estimates, standard errors, design effects and estimated population totals for each sampling plan
The table below summarizes the values obtained from the descriptive statistics that we ran under each of the sampling plans, as well as the estimated population size. It also contains the population values, which, of course, are not estimates, and hence do not have standard errors or design effects associated with them. (To obtain the design effects, you will need to issue the estat effects command after the analysis command.) Design effects are the ratio of the variance of the variable under the current sampling design to the estimated variance under simple random sampling. In other words, it is an estimate of efficiency of the current sampling design relative to simple random sampling. As you can see, the standard errors and the design effects for the stratified simple random sample are the smallest, followed closely by those for the simple random sample. The design effects obtained under the systematic sample are slightly larger, and they become even larger when cluster sampling is used. The largest design effects are obtained using stratified one-stage cluster sampling. Also notice that cluster sampling yields estimates of the population size that are considerably different from those obtained using other types of sampling plans. You should not assume that this pattern of results will be obtained every time these sampling plans are compared. Some plans that look relatively inefficient in this example may appear to be more efficient with other samples and/or other data.
mean api00 | mean growth | total yr_rnd | estimated population size | |||||||
estimate | standard error | design effect | estimate | standard error | design effect | estimate | standard error | design effect | ||
population values | 664.71 | N/A | N/A | 32.80 | N/A | N/A | 874 | N/A | N/A | 6194 |
SRS | 663.26 | 7.21 | 1 | 33.85 | 1.67 | 1 | 719.30 | 110.03 | 1 | 6194 |
Stratified SRS | 665.62 | 2.96 | .35 | 33.27 | 1.10 | .96 | 789.55 | 76.60 | .95 | 6194 |
Systematic | 656.31 | 5.66 | 1 | 33.09 | 1.23 | 1 | 779.12 | 90.45 | 1 | 6194 |
One-stage cluster | 670.52 | 11.10 | 15.28 | 32.86 | 1.44 | 4.55 | 797.05 | 176.06 | 14.97 | 5860 |
Stratified two-stage cluster | 681.84 | 10.45 | 3.90 | 30.72 | 2.23 | 3.01 | 818.92 | 214.92 | 6.09 | 6033 |