Statistical Computing Seminars: Introduction to Survey Data AnalysisThe purpose of this seminar is to introduce you to the use of Stata, SUDAAN, WesVar and SAS for the analysis of survey data. It will draw much of its materials and examples from Choosing the Correct Analysis for Various Survey Designs.
Why do we need survey data analysis software?
Regular statistical software (that is not designed for survey data) analyzes data as if the data were collected using simple random sampling. For experimental and quasi-experimental designs, this is exactly what we want. However, when surveys are conducted, a simple random sample is rarely collected. Not only is it nearly impossible to do so, but it is not as efficient (both financially and statistically) as other sampling methods. When any sampling method other than simple random sampling is used, we need to use survey data analysis software to take into account the differences between the design that was used and simple random sampling. The sampling design affects the calculation of the standard errors of the estimates. If you ignore the sampling design, e.g., if you assume simple random sampling when another type of sampling design was used, the standard errors will likely be underestimated, possibly leading to results that seem to be statistically significant, when in fact, they are not. The difference in point estimates and standard errors obtained using non-survey software and survey software with the design properly specified will vary from data set to data set, and even between variables within the same data set. While it may be possible to get reasonably accurate results using non-survey software, there is no practical way to know beforehand how far off the results from non-survey software will be.
Sampling designs
Most people do not conduct their own surveys. Rather, they use survey data that some agency or company collected and made available to the public. The documentation must be read carefully to find out what kind of sampling design was used to collect the data. This is very important because many of the estimates and standard errors are calculated differently for the different sampling designs. Hence, if you mis-specify the sampling design, the point estimates and standard errors will likely be wrong.
Below are some common features of many sampling designs.
Weights: There are many types of weights that can be associated with a survey. Perhaps the most common is the probability weight, which is used to denote the inverse of the probability of being included in the sample due to the sampling design (except for a certainty PSU, see below). The probability weight is calculated as N/n, where N = the number of elements in the population and n = the number of elements in the sample. For example, if a population has 10 elements and 3 are sampled at random with replacement, then the probability weight would be 10/3 = 3.33. The sum of the probability weights should equal the population total. The probability weight may be corrected for several things, such as errors in the sampling frame, unit non-response, and raking to known population totals. Once these corrections have been made, it is called a sampling weight.
PSU: This is the primary sampling unit. This is the first unit that is sampled in the design. For example, school districts from California may be sampled and then schools within districts may be sampled. The school district would be the PSU. If states from the US were sampled, and then school districts from within each state, and then schools from within each district, then states would be the PSU. One does not need to use the same sampling method at all levels of sampling. For example, probability-proportional-to-size sampling may be used at level 1 (to select states), while cluster sampling is used at level 2 (to select school districts). In the case of a simple random sample, the PSUs and the elementary units are the same.
Strata: Stratification is a method of breaking up the population into different groups, often by demographic variables such as gender, race or SES. Once these groups have been defined, one samples from each group as if it were independent of all of the other groups. For example, if a sample is to be stratified on gender, men and women would be sampled independent of one another. This means that the sampling weights for men will likely be different from the sampling weights for the women. In most cases, you need to have two or more PSUs in each stratum. The purpose of stratification is to improve the precision of the estimates.
FPC: This is the finite population correction. This is used when the sampling fraction (the number of elements or respondents sampled relative to the population) becomes large. The FPC is used in the calculation of the standard error of the estimate. If the value of the FPC is close to 1, it will have little impact and can be safely ignored. In some survey data analysis programs, such as SUDAAN, this information will be needed if you specify that the data were collected without replacement (see below for a definition of “without replacement”). To see the impact of the FPC for samples of various proportions, suppose that you had a population of 10,000 elements.
Sample size (n) FPC
1 1.0 10 .9995 100 .9950 500 .9747 1000 .9487 5000 .7071 9000 .3162
Imputation flag: This is a 0/1 variable that is associated with a variable in the data set and indicates whether the corresponding value in the associated variable was imputed or given by the respondent. For example, in the data set below
Subject Response ImputeFlag
1 60 0 2 60 1 3 63 0
the data for subject number 2 was imputed. The flag does not tell you how the imputation was done (i.e., mean substitution, multiple imputation, etc.). These variables are useful for determining how much missing data each variable has.
Non-response weight: There are both unit and item non-response weights. The former down-weights an entire case because the respondent did not respond to any of the items on the survey (perhaps he wasn’t home that day). The later down-weights “responses” from respondents who did not answer that item.
Certainty PSU: This is a PSU that was guaranteed to be in the sample. This is independent of the sampling design: any sampling design can have one or more certainty PSUs. Certainty PSUs are also called self-representing units.
Poststratification: This is stratification that happens after the sample has been collected, either because the information needed to do stratification was not available when the sample was collected, or because it was not known at the time of data collection that stratification on this variable would be necessary/desirable. The purpose of poststratification is to improve the precision of the estimates or to reduce bias caused by non-response.
Clearly, not all surveys will have all of the features listed above. We will concentrate only on the first four features because they are the most common.
Sampling with and without replacement
Most samples collected in the real world are collected “without replacement”. This means that once a respondent has been selected to be in the sample and has participated in the survey, that particular respondent cannot be selected again to be in the sample. Many of the calculations change depending on if a sample is collected with or without replacement. Hence, programs like SUDAAN request that you specify if a survey sampling design was implemented with our without replacement, and an FPC is used if sampling without replacement is used, even if the value of the FPC is very close to one.
Replicate weights
Replicate weights are a feature of an increasing number of public use survey data sets. Replicate weights are a series of weight variables that are used instead of PSUs and strata in an effort to protect the respondents’ identity. Either replicate weights or a Taylor series linearization, which is bases on PSUs and/or strata, are necessary for variance estimation.
Summary of four survey data analysis packages
We are now going to summarize some of the features of four survey data analysis packages: Stata, SUDAAN, WesVar and SAS. On feature that all four programs share is that once you specify the sampling design, it is either 1) applied to all analyses until you change it or exit the program (Stata and WesVar) or 2) very easy to apply to all analyses (SUDAAN and SAS). In other words, you only need to go through the work of specifying the design once, and then it applies to all analyses of that data.
Stata:
- handles most sampling designs, except two-stage cluster sampling, probability-proportional-to-size sampling, poststratification and certainty PSUs
- has the most statistical procedures of any of the packages
- does not handle replicate weights
- has a relatively easy to use command interface (point and click in Stata version 8)
SUDAAN:
- handles all sampling designs
- has a fair number of statistical procedures
- handles replicate weights (except for survival analysis)
- has a relatively more difficult to use command interface
WesVar:
- handles all sampling designs except two-stage cluster sampling
- has a fair number of statistical procedures
- handles replicate weights (and can create them from PSUs and strata)
- has a relatively easy to use point-and-click interface
SAS:
- handles all sampling designs except poststratification and two-stage cluster sampling
- has a VERY limited number of statistical features (only means and regression in version 8, frequencies maybe logistic regression in version 9)
- does not handle replicate weights
- has a relatively more difficult to use command interface
Examples
Now we are going to try some analyses using the different packages. All of the examples shown below were presented in Levy and Lemeshow’s Sampling of Populations. These and other examples from that text and other texts can be found on our website. We will focus on Stata and SUDAAN. If there is time, we will show an example in WesVar. The code to do these examples in SAS is given at the end of this handout, along with some explanation.
Simple random sample in Stata
The entire Stata version 7 program can be downloaded here. The entire Stata version 8 program can be downloaded here. We will use the Stata version 7 code for this seminar.
Although simple random sampling (SRS) is almost never used, we start with this example because it is the least complex and it will serve as a comparison for later examples. Note that in SRS sampling, each observation is a PSU. The Stata code that would be used with a SRS design is given below.
use https://stats.idre.ucla.edu/stat/books/sop/momsag.dta, clear list birth weight1 momsag in 1/10
birth weight1 momsag 1. 773 30.92 0 2. 773 30.92 1 3. 773 30.92 1 4. 773 30.92 1 5. 773 30.92 1 6. 773 30.92 1 7. 773 30.92 1 8. 773 30.92 1 9. 773 30.92 1 10. 773 30.92 1
svyset pweight weight1 svyset fpc birth svymean momsag Survey mean estimation pweight: weight1 Number of obs = 25 Strata: <one> Number of strata = 1 PSU: <observations> Number of PSUs = 25 FPC: birth Population size = 773 ------------------------------------------------------------------------------ Mean | Estimate Std. Err. [95% Conf. Interval] Deff ---------+-------------------------------------------------------------------- momsag | .92 .0544746 .8075699 1.03243 1 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs. Weights must represent population totals for deff to be correct when using an FPC. Note: deft is invariant to the scale of weights.
Now let’s see what happens when you ignore the sampling design. We will clear the survey sets from Stata and use the ci command to get the mean, standard error and confidence interval. Hence, in this analysis, the pweight and the fpc are ignored. We will use the svyset command with no options to check that no variables have be set.
* PLEASE REMEMBER THAT THE ANALYSIS BELOW IS INCORRECT!! svyset, clear svyset no variables have been set ci momsag
Variable | Obs Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------------------- momsag | 25 .92 .0553775 .8057065 1.034294
As you can see, the mean is the same as that obtained when including the sampling design information. However, the standard error is larger. If we multiply the standard error by the square root of the fpc, we will obtain the correct standard error.
display sqrt((773-25)/773)*.0553775
.05447464
svyset pweight weight1 svyset fpc birth svytotal momsag Survey total estimation pweight: weight1 Number of obs = 25 Strata: <one> Number of strata = 1 PSU: <observations> Number of PSUs = 25 FPC: birth Population size = 773 ------------------------------------------------------------------------------ Total | Estimate Std. Err. [95% Conf. Interval] Deff ---------+-------------------------------------------------------------------- momsag | 711.16 42.10889 624.2515 798.0685 1 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs. Weights must represent population totals for deff to be correct when using an FPC. Note: deft is invariant to the scale of weights.
Stratified random sampling in Stata
The difference between the example above and the example below is that stratification has been added.
use https://stats.idre.ucla.edu/stat/books/sop/hospsamp.dta, clear list in 1/10
hospno oblevel weighta tothosp births 1. 15 1 10.5 42 480 2. 80 1 10.5 42 426 3. 86 1 10.5 42 342 4. 136 1 10.5 42 174 5. 7 2 19.799988 99 2022 6. 26 2 19.799988 99 576 7. 62 2 19.799988 99 1999 8. 90 2 19.799988 99 482 9. 101 2 19.799988 99 836 10. 28 3 2.8333321 17 3108
svyset pweight weighta svyset strata oblevel svyset fpc tothosp svytotal births Survey total estimation pweight: weighta Number of obs = 15 Strata: oblevel Number of strata = 3 PSU: <observations> Number of PSUs = 15 FPC: tothosp Population size = 157.99993 ------------------------------------------------------------------------------ Total | Estimate Std. Err. [95% Conf. Interval] Deff ---------+-------------------------------------------------------------------- births | 183982.9 34014.33 109872 258093.8 .7035474 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs. Weights must represent population totals for deff to be correct when using an FPC. Note: deft is invariant to the scale of weights. svytotal births, by (oblevel) Survey total estimation pweight: weighta Number of obs = 15 Strata: oblevel Number of strata = 3 PSU: <observations> Number of PSUs = 15 FPC: tothosp Population size = 157.99993 ------------------------------------------------------------------------------ Total Subpop. | Estimate Std. Err. [95% Conf. Interval] Deff ---------------+-------------------------------------------------------------- births | oblevel==1 | 14931 2669.857 9113.882 20748.12 .15648 oblevel==2 | 117116.9 33067.66 45068.68 189165.2 1.089405 oblevel==3 | 51934.98 7508.399 35575.58 68294.37 .0330073 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs. Weights must represent population totals for deff to be correct when using an FPC. Note: deft is invariant to the scale of weights.
One-stage cluster sampling in Stata
use https://stats.idre.ucla.edu/stat/books/sop/tab9_1a.dta, clear list devlpmnt HH wt1 M NVSTNRS NGE65 hhneedvn in 1/10
devlpmnt HH wt1 M NVSTNRS NGE65 hhneedvn 1. 2 1 2.5 5 1 2 1 2. 2 2 2.5 5 0 1 0 3. 2 3 2.5 5 0 2 0 4. 2 4 2.5 5 1 1 1 5. 2 5 2.5 5 0 1 0 6. 2 6 2.5 5 0 1 0 7. 2 7 2.5 5 1 2 1 8. 2 8 2.5 5 1 1 1 9. 2 9 2.5 5 1 3 1 10. 2 10 2.5 5 1 1 1
svyset pweight wt1 svyset fpc M svyset psu devlpmnt svytotal NVSTNRS NGE65 Survey total estimation pweight: wt1 Number of obs = 40 Strata: <one> Number of strata = 1 PSU: devlpmnt Number of PSUs = 2 FPC: M Population size = 100 ------------------------------------------------------------------------------ Total | Estimate Std. Err. [95% Conf. Interval] Deff ---------+-------------------------------------------------------------------- NVSTNRS | 57.5 1.936492 32.89454 82.10546 .0707804 NGE65 | 167.5 1.936492 142.8945 192.1055 .0393542 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs. Weights must represent population totals for deff to be correct when using an FPC. Note: deft is invariant to the scale of weights.
svymean NVSTNRS hhneedvn Survey mean estimation pweight: wt1 Number of obs = 40 Strata: <one> Number of strata = 1 PSU: devlpmnt Number of PSUs = 2 FPC: M Population size = 100 ------------------------------------------------------------------------------ Mean | Estimate Std. Err. [95% Conf. Interval] Deff ---------+-------------------------------------------------------------------- NVSTNRS | .575 .0193649 .3289454 .8210546 .0707804 hhneedvn | .525 .0193649 .2789454 .7710546 .0977444 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs. Weights must represent population totals for deff to be correct when using an FPC. Note: deft is invariant to the scale of weights.
svyratio NVSTNRS NGE65 Survey ratio estimation pweight: wt1 Number of obs = 40 Strata: <one> Number of strata = 1 PSU: devlpmnt Number of PSUs = 2 FPC: M Population size = 100 ------------------------------------------------------------------------------ Ratio | Estimate Std. Err. [95% Conf. Interval] Deff ------------------+----------------------------------------------------------- NVSTNRS/NGE65 | .3432836 .0075924 .2468131 .4397541 .0325067 ------------------------------------------------------------------------------ Finite population correction (FPC) assumes simple random sampling without replacement of PSUs within each stratum with no subsampling within PSUs. Weights must represent population totals for deff to be correct when using an FPC. Note: deft is invariant to the scale of weights.
Simple random sampling using SUDAAN
The SAS data files that are used for the SUDAAN and SAS examples can be downloaded here:
https://stats.idre.ucla.edu/wp-content/uploads/2016/02/momsag-1.sas7bdat https://stats.idre.ucla.edu/wp-content/uploads/2016/02/hospsamp-1.sas7bdat https://stats.idre.ucla.edu/wp-content/uploads/2016/02/tab9_1c-1.sas7bdat
The entire SUDAAN program can be downloaded here.
proc descript data = momsag filetype = sas design = wor total ; weight weight1; nest _one_; totcnt birth; var momsag; run; Number of observations read : 25 Weighted count : 773 Denominator degrees of freedom : 24 Variance Estimation Method: Taylor Series (WOR) by: Variable, One. ----------------------------------------------------- | | | | Variable | | One | | | 1 | ----------------------------------------------------- | | | | | MOMSAG | Sample Size | 25 | | | Weighted Size | 773.00 | | | Total | 711.16 | | | SE Total | 42.11 | | | Mean | 0.92 | | | SE Mean | 0.05 | -----------------------------------------------------
Stratified random sampling using SUDAAN
proc descript data = hospsamp filetype = sas design = wor totals; nest oblevel; weight weighta; totcnt tothosp; var births; subgroup oblevel; levels 3; setenv decwidth = 3; run;
Number of observations read : 15 Weighted count : 158 Denominator degrees of freedom : 12 Variance Estimation Method: Taylor Series (WOR) by: Variable, OBLEVEL. ------------------------------------------------------------------------------------ | | | | Variable | | OBLEVEL | | | Total | 1 | ------------------------------------------------------------------------------------ | | | | | | BIRTHS | Sample Size | 15.000 | 4.000 | | | Weighted Size | 158.000 | 42.000 | | | Total | 183982.904 | 14931.000 | | | SE Total | 34014.329 | 2669.857 | | | Mean | 1164.449 | 355.500 | | | SE Mean | 215.281 | 63.568 | ------------------------------------------------------------------------------------ Variance Estimation Method: Taylor Series (WOR) by: Variable, OBLEVEL. ------------------------------------------------------------------------------------ | | | | Variable | | OBLEVEL | | | 2 | 3 | ------------------------------------------------------------------------------------ | | | | | | BIRTHS | Sample Size | 5.000 | 6.000 | | | Weighted Size | 99.000 | 17.000 | | | Total | 117116.928 | 51934.977 | | | SE Total | 33067.664 | 7508.399 | | | Mean | 1183.000 | 3055.000 | | | SE Mean | 334.017 | 441.671 | ------------------------------------------------------------------------------------
One-stage cluster sampling using SUDAAN
proc descript data = tab9_1c filetype =sas design = wor means totals; nest _one_ devlpmnt; totcnt m _zero_; weight wt1; var nge65 nvstnrs hhneedvn; run;
Number of observations read : 40 Weighted count : 100 Denominator degrees of freedom : 1 Variance Estimation Method: Taylor Series (WOR) by: Variable, One. ------------------------------------------------------ | | | | Variable | | One | | | 1 | ------------------------------------------------------ | | | | | NGE65 | Sample Size | 40.00000 | | | Weighted Size | 100.00000 | | | Total | 167.50000 | | | SE Total | 1.93649 | | | Mean | 1.67500 | | | SE Mean | 0.01936 | ------------------------------------------------------ | | | | | NVSTNRS | Sample Size | 40.00000 | | | Weighted Size | 100.00000 | | | Total | 57.50000 | | | SE Total | 1.93649 | | | Mean | 0.57500 | | | SE Mean | 0.01936 | ------------------------------------------------------ | | | | | HHNEEDVN | Sample Size | 40.00000 | | | Weighted Size | 100.00000 | | | Total | 52.50000 | | | SE Total | 1.93649 | | | Mean | 0.52500 | | | SE Mean | 0.01936 | ------------------------------------------------------ proc ratio data = tab9_1c filetype = sas design = wor; nest _one_ devlpmnt; totcnt M _zero_; weight wt1; numer nvstnrs; denom nge65; setenv decwidth = 5; run;
Number of observations read : 40 Weighted count : 100 Denominator degrees of freedom : 1 Variance Estimation Method: Taylor Series (WOR) by: Variable, One. ------------------------------------------------------ | | | | Variable | | One | | | 1 | ------------------------------------------------------ | | | | | NVSTNRS/NGE65 | Sample Size | 40.00000 | | | Weighted Size | 100.00000 | | | Weighted X-Sum | 167.50000 | | | Weighted Y-Sum | 57.50000 | | | Ratio Est. | 0.34328 | | | SE Ratio | 0.00759 | ------------------------------------------------------
Simple random sampling using SAS
The entire SAS program can be downloaded here.
proc surveymeans data = momsag n = 773 mean sum std; weight weight1; var momsag; run;
The SURVEYMEANS Procedure Data Summary Number of Observations 25 Sum of Weights 773.000002 Statistics Std Error Variable Mean of Mean Sum Std Dev ------------------------------------------------------------------------ MOMSAG 0.920000 0.054475 711.160002 42.108894 ------------------------------------------------------------------------
Stratified random sampling using SAS
data second138; input id _TOTAL_ oblevel; cards; 1 42 1 2 42 1 3 42 1 4 42 1 5 99 2 6 99 2 7 99 2 8 99 2 9 99 2 10 17 3 11 17 3 12 17 3 13 17 3 14 17 3 15 17 3 ; run;
NOTE: You cannot get the totals for both the whole group and the sub-groups in the same proc surveymeans. NOTE: The data set second138 is used to tell SAS what the totals are in each stratum. These totals are used to compute the finite population correction (fpc). SAS allows only one number to be supplied on the proc surveymeans statement. Because the totals change from one stratum to the next, we need to supply them to SAS in a data set. You can include these data in the primary data set or in a secondary data set. In this example, we will use a secondary data set. Also note that the secondary data set can be “collapsed”; in other words, just one line (observations) for each strata. In the secondary data set, the variable that contains the totals must be called _TOTAL_. The variable oblevel is copied from the original data set because SAS requires all of the variables listed on the strata statement to appear in this data set. In our example, there is only one variable listed on the strata statement, but in other cases, there may be two or more variables listed.
proc surveymeans data = hospsamp n = second138 sum ; weight weighta; strata oblevel; var births; run;
The SURVEYMEANS Procedure Data Summary Number of Strata 3 Number of Observations 15 Sum of Weights 157.999931 Statistics Variable Sum Std Dev ---------------------------------------- births 183983 34014 ----------------------------------------
proc surveymeans data = hospsamp n = second138 sum; weight weighta; strata oblevel; by oblevel; var births; run;
oblevel=1 The SURVEYMEANS Procedure Data Summary Number of Strata 1 Number of Observations 4 Sum of Weights 42 Statistics Variable Sum Std Dev ---------------------------------------- births 14931 2669.856738 ---------------------------------------- oblevel=2 The SURVEYMEANS Procedure Data Summary Number of Strata 1 Number of Observations 5 Sum of Weights 98.999939 Statistics Variable Sum Std Dev ---------------------------------------- births 117117 33068 ---------------------------------------- oblevel=3 The SURVEYMEANS Procedure Data Summary Number of Strata 1 Number of Observations 6 Sum of Weights 16.9999924 Statistics Variable Sum Std Dev ---------------------------------------- births 51935 7508.399372 ----------------------------------------
One-stage cluster sampling using SAS
proc surveymeans data = tab9_1c n = 5 sum mean; weight wt1; cluster devlpmnt; var nge65 nvstnrs hhneedvn; run;
The SURVEYMEANS Procedure Data Summary Number of Clusters 2 Number of Observations 40 Sum of Weights 100 Statistics Std Error Variable Mean of Mean Sum Std Dev ------------------------------------------------------------------------ NGE65 1.675000 0.019365 167.500000 1.936492 NVSTNRS 0.575000 0.019365 57.500000 1.936492 HHNEEDVN 0.525000 0.019365 52.500000 1.936492 ------------------------------------------------------------------------