Survey Data Analysis with R
Why do we need survey data analysis software?
Regular procedures in statistical software (that is not designed for survey data) analyzes data as if the data were collected using simple random sampling. For experimental and quasi-experimental designs, this is exactly what we want. However, very few surveys use a simple random sample to collect data. Not only is it nearly impossible to do so, but it is not as efficient (either financially and statistically) as other sampling methods. When any sampling method other than simple random sampling is used, we usually need to use survey data analysis software to take into account the differences between the design that was used to collect the data and simple random sampling. This is because the sampling design affects both the calculation of the point estimates and the standard errors of those estimates. If you ignore the sampling design, e.g., if you assume simple random sampling when another type of sampling design was used, both the point estimates and their standard errors will likely be calculated incorrectly. The sampling weight will affect the calculation of the point estimate, and the stratification and/or clustering will affect the calculation of the standard errors. Ignoring the clustering will likely lead to standard errors that are underestimated, possibly leading to results that seem to be statistically significant, when in fact, they are not. The difference in point estimates and standard errors obtained using non-survey software and survey software with the design properly specified will vary from data set to data set, and even between analyses using the same data set. While it may be possible to get reasonably accurate results using non-survey software, there is no practical way to know beforehand how far off the results from non-survey software will be.
Sampling designs
Most people do not conduct their own surveys. Rather, they use survey data that some agency or company collected and made available to the public. The documentation must be read carefully to find out what kind of sampling design was used to collect the data. This is very important because many of the estimates and standard errors are calculated differently for the different sampling designs. Hence, if you mis-specify the sampling design, the point estimates and standard errors will likely be wrong.
Below are some common features of many sampling designs.
Sampling weights: There are several types of weights that can be associated with a survey. Perhaps the most common is the sampling weight. A sampling weight is a probability weight that has had one or more adjustments made to it. Both a sampling weight and a probability weight are used to weight the sample back to the population from which the sample was drawn. By definition, a probability weight is the inverse of the probability of being included in the sample due to the sampling design (except for a certainty PSU, see below). The probability weight is calculated as N/n, where N = the number of elements in the population and n = the number of elements in the sample. For example, if a population has 10 elements and 3 are sampled at random with replacement, then the probability weight would be 10/3 = 3.33. In a two-stage design, the probability weight is calculated as f1f2, which means that the inverse of the sampling fraction for the first stage is multiplied by the inverse of the sampling fraction for the second stage. Under many sampling plans, the sum of the probability weights will equal the population total.
While many textbooks will end their discussion of probability weights here, this definition does not fully describe the sampling weights that are included with actual survey data sets. Rather, the sampling weight, which is sometimes called a “final weight,” starts with the inverse of the sampling fraction, but then incorporates several other values, such as corrections for unit non-response, errors in the sampling frame (sometimes called non-coverage), calibration and trimming. Because these other values are included in the probability weight that is included with the data set, it is often inadvisable to modify the sampling weights, such as trying to standardize them for a particular variable, e.g., age.
Strata: Stratification is a method of breaking up the population into different groups, often by demographic variables such as gender, race or SES. Each element in the population must belong to one, and only one, strata. Once the strata have been defined, samples are taken from each stratum as if it were independent of all of the other strata. For example, if a sample is to be stratified on gender, men and women would be sampled independently of one another. This means that the probability weights for men will likely be different from the probability weights for the women. In most cases, you need to have two or more PSUs in each stratum. The purpose of stratification is to reduce the standard error of the estimates, and stratification works most effectively when the variance of the dependent variable is smaller within the strata than in the sample as a whole.
PSU: This is the primary sampling unit. This is the first unit that is sampled in the design. For example, school districts from California may be sampled and then schools within districts may be sampled. The school district would be the PSU. If states from the US were sampled, and then school districts from within each state, and then schools from within each district, then states would be the PSU. One does not need to use the same sampling method at all levels of sampling. For example, probability-proportional-to-size sampling may be used at level 1 (to select states), while cluster sampling is used at level 2 (to select school districts). In the case of a simple random sample, the PSUs and the elementary units are the same. In general, accounting for the clustering in the data (i.e., using the PSUs), will increase the standard errors of the point estimates. Conversely, ignoring the PSUs will tend to yield standard errors that are too small, leading to false positives when doing significance tests.
FPC: This is the finite population correction. This is used when the sampling fraction (the number of elements or respondents sampled relative to the population) becomes large. The FPC is used in the calculation of the standard error of the estimate. If the value of the FPC is close to 1, it will have little impact and can be safely ignored. In some survey data analysis programs, such as SUDAAN, this information will be needed if you specify that the data were collected without replacement (see below for a definition of “without replacement”). The formula for calculating the FPC is ((N-n)/(N-1))1/2, where N is the number of elements in the population and n is the number of elements in the sample. To see the impact of the FPC for samples of various proportions, suppose that you had a population of 10,000 elements.
Sample size (n) FPC
1 1.0000
10 .9995
100 .9950
500 .9747
1000 .9487
5000 .7071
9000 .3162
Replicate weights: Replicate weights are a series of weight variables that are used to correct the standard errors for the sampling plan. They serve the same function as the PSU and strata variables (which are used a Taylor series linearization) to correct the standard errors of the estimates for the sampling design. Many public use data sets are now being released with replicate weights instead of PSUs and strata in an effort to more securely protect the identity of the respondents. In theory, the same standard errors will be obtained using either the PSU and strata or the replicate weights. There are different ways of creating replicate weights; the method used is determined by the sampling plan. The most common are balanced repeated and jackknife replicate weights. You will need to read the documentation for the survey data set carefully to learn what type of replicate weight is included in the data set; specifying the wrong type of replicate weight will likely lead to incorrect standard errors. For more information on replicate weights, please see Stata Library: Replicate Weights. Several statistical packages, including Stata, SAS, R, Mplus, SUDAAN and WesVar, allow the use of replicate weights. Another good source of information on replicate weights is Applied Survey Data Analysis, Second Edition by Steven G. Heeringa, Brady T. West and Patricia A. Berglund (2017, CRC Press).
Consequences of not using the design elements
Sampling design elements include the sampling weights, post-stratification weights (if provided), PSUs, strata, and replicate weights. Rarely are all of these elements included in a single public-use data set. However, ignoring the design elements that are included can often lead to inaccurate point estimates and/or inaccurate standard errors.
Sampling with and without replacement
Most samples collected in the real world are collected “without replacement”. This means that once a respondent has been selected to be in the sample and has participated in the survey, that particular respondent cannot be selected again to be in the sample. Many of the calculations change depending on if a sample is collected with or without replacement. Hence, programs like SUDAAN request that you specify if a survey sampling design was implemented with our without replacement, and an FPC is used if sampling without replacement is used, even if the value of the FPC is very close to one.
Examples
For the examples in this workshop, we will use the data set from NHANES 2011-2012. The data set and documentation can be downloaded from the NHANES web site. The data files can be downloaded as SAS.xpt files. The R script used for the workshop is here.
Reading the documentation
The first step in analyzing any survey data set is to read the documentation. With many of the public use data sets, the documentation can be quite extensive and sometimes even intimidating. Instead of trying to read the documentation “cover to cover”, there are some parts you will want to focus on. First, read the Introduction. This is usually an “easy read” and will orient you to the survey. There is usually a section or chapter called something like “Sample Design and Analysis Guidelines”, “Variance Estimation”, etc. This is the part that tells you about the design elements included with the survey and how to use them. Some even give example code. If multiple sampling weights have been included in the data set, there will be some instruction about when to use which one. If there is a section or chapter on missing data or imputation, please read that. This will tell you how missing data were handled. You should also read any documentation regarding the specific variables that you intend to use. As we will see little later on, we will need to look at the documentation to get the value labels for the variables. This is especially important because some of the values are actually missing data codes, and you need to do something so that R doesn’t treat those as valid values (or you will get some very “interesting” means, totals, etc.).
Downloading and installing the packages
You should use the following install.packages commands only once. The commands should be left in your R script file but commented out (by placing a “#” symbol before each).
install.packages("haven") install.packages("survey") install.packages("jtools") install.packages("remotes") remotes::install_github("carlganz/svrepmisc")
After the packages are downloaded, they need to be loaded. This needs to be done at the beginning of each R session.
library("haven") library("survey") library("jtools") library("remotes") library("svrepmisc")
Reading the data into R
The data are distributed from the NHANES website as SAS.xpt files. You can use the “foreign” package to read the data into R. The command will look like this:
nhanes2012 <- read_dta("D:/data/Seminars/Applied Survey Data Analysis R/nhanes2012_reduced.dta")
The variables
We will use about a dozen different variables in the examples in this workshop. Below is a brief summary of them. Some of the variables have been recoded to be binary variables (values of 2 recoded to a value of 0). The count of missing observations includes values truly missing as well as refused and don’t know.
ridageyr – Age in years at exam – recoded; range of values: 0 – 79 are actual values, 80 = 80+ years of age
pad630 – How much time do you spend doing moderate-intensity activities on a type work day?; range of values: 10-960, 7053 missing observations
hsq496 – During the past 30 days, for about how many days have you felt worried, tense or anxious?; range of values: 0-30; 3073 missing observations
female – Recode of the variable riagendr; 0 = male, 1 = female; no missing observations
dmdborn4 – Country of birth; 1 = born in the United States, 0 = otherwise; 5 missing observations
dmdmartl – Marital status; 1 = married, 2 = widowed, 3 = divorced, 4 = separated, 5 = never married, 6 = living with partner; 4203 missing observations
dmdeduc2 – Education level of adults aged 20+ years; 1 = less than 9th grade, 2 = 9-11th grade, 3 = high school graduate, GED or equivalent, 4 = some college or AA degree, 5 = college graduate or above; 4201 missing observations
pad675 – How much time do you spend doing moderate-intensity sports, fitness, or recreation activities on a typical day?; range of values: 10-600; 6220 missing observations
hsq571 – During the past 12 months, have you donated blood?; 0 = no, 1 = yes; 3673 missing observations
pad680 – How much time do you usually spend sitting on a typical day?; range of values: 0-1380; 2365 missing observations
paq665 – Do you do any moderate-intensity sports, fitness or recreational activities that cause a small increase in breathing or heart rate at least 10 minutes continually?; 0 = no, 1 = yes; 2329 missing observations
hsd010 – Would you say that your general health is…; 1 = excellent, 2 = very good, 3 = good, 4 = fair, 5 = poor; 3064 missing observations
The svydesign function
Before we can start our analyses, we need to use the svydesign function from the “survey” package written by Thomas Lumley. The svydesign function tells R about the design elements in the survey. Once this command has been issued, all you need to do for your analyses is use the object that contains this information in each command. Because the 2011-2012 NHANES data were released with a sampling weight (wtint2yr), a PSU variable (sdmvpsu) and a strata variable (sdmvstra), we will use these our svydesign function. The svydesign function looks like this:
# nhanes2012 <- read_dta("D:/data/Seminars/Applied Survey Data Analysis R/nhanes2012_reduced.dta") nhc <- svydesign(id=~sdmvpsu, weights=~wtint2yr,strata=~sdmvstra, nest=TRUE, survey.lonely.psu = "adjust", data=nhanes2012) nhc
Stratified 1 - level Cluster Sampling design (with replacement) With (31) clusters.
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2012)
We can get additional information about the sample, such as the number of PSUs per strata, by using the summary function.
summary(nhc) Stratified 1 - level Cluster Sampling design (with replacement) With (31) clusters. svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2012) Probabilities: Min. 1st Qu. Median Mean 3rd Qu. Max. 4.541e-06 2.866e-05 5.526e-05 6.372e-05 8.809e-05 3.011e-04 Stratum Sizes: 90 91 92 93 94 95 96 97 98 99 100 101 102 103 obs 862 998 875 602 688 722 676 608 708 682 700 715 624 296 design.PSU 3 3 3 2 2 2 2 2 2 2 2 2 2 2 actual.PSU 3 3 3 2 2 2 2 2 2 2 2 2 2 2 Data variables: [1] "dmdborn4" "dmdeduc2" "dmdmartl" "dmdyrsus" "female" "hsd010" "hsq496" [8] "hsq571" "pad615" "pad630" "pad660" "pad675" "pad680" "paq610" [15] "paq625" "paq640" "paq655" "paq665" "paq670" "paq710" "paq715" [22] "ridageyr" "sdmvpsu" "sdmvstra" "seqn" "wtint2yr"
Descriptive statistics with continuous variables
We will start with something simple: calculating the mean of a continuous variable. In this example, we use the variable ridageyr, which is the age of the respondent. Please note that the documentation for the svymean function, as well as other functions that provide descriptive statistics, is found in the section of the documentation called surveysummary.
svymean(~ridageyr, nhc) mean SE ridageyr 37.185 0.6965
We can also get the standard deviation of the age variable. We use the function svysd, which is found in the jtools package.
svysd(~ridageyr,design = nhc, na = TRUE) std. dev. ridageyr 22.37
When there are missing data for a variable, the na = TRUE argument is needed.
svymean(~pad630, nhc, na = TRUE) mean SE pad630 139.89 5.5791
Here is another example.
svymean(~hsq496, nhc, na = TRUE) mean SE hsq496 5.3839 0.19
The means of more than one variable can be obtained by placing “+” between the variables. Notice that a listwise deletion has been done, so that the means in output below are different from the means shown above.
svymean(~ridageyr+pad630+hsq496, nhc, na = TRUE) mean SE ridageyr 41.9779 0.8212 pad630 139.7228 5.7137 hsq496 5.3739 0.2781
In the examples below, the mean, standard deviation, and variance are obtained.
svymean(~pad680, nhc, na = TRUE) mean SE pad680 391.3 5.986
svysd(~pad680,design = nhc, na = TRUE) std. dev. pad680 200.846
svyvar(~pad680, design = nhc, na = TRUE) variance SE pad680 40339 827.77
The cv function is used to get the coefficient of variation. The coefficient of variation is the ratio of the standard error to the mean, multiplied by 100%. It is an indication of the variability relative to the mean in the population and is not affected by the unit of measurement of the variable.
cv(svymean(~pad680,design = nhc, na = TRUE)) pad680 pad680 0.01529783
Below is an example of how to obtain the design effect (Deff). The Deff is a ratio of two variances. In the numerator we have the variance estimate from the current sample (including all of its design elements), and in the denominator we have the variance from a hypothetical sample of the same size drawn as an SRS. In other words, the Deff tells you how efficient your sample is compared to an SRS of equal size. If the Deff is less than 1, your sample is more efficient than SRS; usually the Deff is greater than 1. In the example below, the Deff is 5.9887. This means that a sample drawn using the current sampling plan needs to be six times the size needed if the sample was collected via an SRS. The Deff is specific to a variable, so some variables may have larger or smaller Deffs.
svymean(~pad680, nhc, na = TRUE, deff = TRUE) mean SE DEff pad680 391.300 5.986 5.9887
Quantiles are a useful descriptive statistic for continuous variables, particularly variables that are not normally distributed.
svyquantile(~hsq496, design = nhc, na = TRUE, c(.25,.5,.75),ci=TRUE) $quantiles 0.25 0.5 0.75 hsq496 0 1 5 $CIs , , hsq496 0.25 0.5 0.75 (lower 0 1 5 upper) 0 2 7
Let’s get some descriptive statistics for binary variables. We have some choices here. Let’s start by getting the mean of the variable female, which is coded 1 for females and 0 for males. Taking the mean of a variable that is coded 0/1 gives the proportion of 1s, so the mean of this variable is the estimated proportion of the population that is female.
# means and proportions for binary variables svymean(~female, nhc) mean SE female 0.51195 0.0064
We can use the confint function to get the confidence interval around this mean.
confint(svymean(~female, nhc)) 2.5 % 97.5 % female 0.4993307 0.5245742
However, comment on page 70of the documentation for the survey package, we should use svyciprop rather than confint. There are several options that can be supplied for the method argument. Please see pages 70-71 of the documentation. The likelihood option uses the (Rao-Scott) scaled chi-squared distribution for the log likelihood from a binomial distribution.
svyciprop(~I(female==1), nhc, method="likelihood") 2.5% 97.5% I(female == 1) 0.512 0.498 0.53
# li is short for likelihood svyciprop(~I(female==0), nhc, method="li") 2.5% 97.5% I(female == 0) 0.488 0.474 0.5
The logit option fits a logistic regression model and computes a Wald-type interval on the log-odds scale, which is then transformed to the probability scale.
svyciprop(~I(female==1), nhc, method="logit") 2.5% 97.5% I(female == 1) 0.512 0.498 0.53
The xlogit option uses a logit transformation of the mean and then back-transforms to the probability scale. This appears to be the method used by SUDAAN and SPSS COMPLEX SAMPLES.
svyciprop(~I(female==1), nhc, method="xlogit") 2.5% 97.5% I(female == 1) 0.512 0.498 0.53
As stated on page 71 of the documentation, the use of the mean option (shortened in the code below to me), reproduces the results given by Stata’s svy: mean command.
svyciprop(~I(female==1), nhc, method="me", df=degf(nhc)) 2.5% 97.5% I(female == 1) 0.512 0.498 0.53
As stated on page 71 of the documentation, the use of the mean option (shortened in the code below to me), reproduces the results given by Stata’s svy: prop command.
svyciprop(~I(female==1), nhc, method="lo", df=degf(nhc)) 2.5% 97.5% I(female == 1) 0.512 0.498 0.53
You can also get the proportions of 0s, as shown below.
svyciprop(~I(female==0), nhc, method="mean") 2.5% 97.5% I(female == 0) 0.488 0.474 0.5
Below is another way to get the proportions of 1s. The point of these examples is that you can use this type of syntax to get the proportion of any level of a categorical variable.
svyciprop(~I(female==1), nhc, method="mean") 2.5% 97.5% I(female == 1) 0.512 0.498 0.53
Finally, let’s see how to get totals. We will use the svytotal function.
svytotal(~dmdborn4,design = nhc, na = TRUE) total SE dmdborn4 260493506 19670647
In the next example, we will get the coefficient of variation for this total.
cv(svytotal(~dmdborn4,design = nhc, na = TRUE)) dmdborn4 dmdborn4 0.075513
Let’s get the design effect for the total.
svytotal(~dmdborn4,design = nhc, na = TRUE, deff = TRUE) total SE DEff dmdborn4 260493506 19670647 314.9
Descriptive statistics for categorical variables
Let’s see some ways to get descriptive statistics for categorical variables, whether or not these variables are binary. We will start with the table function. In the example below, we get the frequencies for the variable female.
svytable(~female, design = nhc) female 0 1 149630839 156959842
We can also use the table function to get crosstabulations. We will start with two-way crosstabs.
# 2-way svytable(~female+dmdborn4, nhc) dmdborn4 female 0 1 0 22449131 127102676 1 23543299 133390830
In the next example, we use a different syntax to do the same thing. Notice that the output is displayed differently, although the information in the output is the same.
# 2-way svytable(~interaction(female, dmdborn4), design = nhc) interaction(female, dmdborn4) 0.0 1.0 0.1 1.1 22449131 23543299 127102676 133390830
Now let’s get a three-way table.
# 3-way svytable(~interaction(female, dmdborn4, hsq571), design = nhc) interaction(female, dmdborn4, hsq571) 0.0.0 1.0.0 0.1.0 1.1.0 0.0.1 1.0.1 0.1.1 1.1.1 16994772.6 16101456.8 78140097.2 83984812.9 503957.9 277883.3 7083759.3 6155573.1
Let’s go a little crazy and get a four-way table.
# 4-way svytable(~interaction(female, dmdborn4, hsq571, paq665), design = nhc) interaction(female, dmdborn4, hsq571, paq665) 0.0.0.0 1.0.0.0 0.1.0.0 1.1.0.0 0.0.1.0 1.0.1.0 0.1.1.0 9742281.10 9259759.05 41213690.37 43983640.78 314995.28 95874.61 2938039.53 1.1.1.0 0.0.0.1 1.0.0.1 0.1.0.1 1.1.0.1 0.0.1.1 1.0.1.1 2587794.24 7252491.54 6841697.75 36908800.64 40001172.12 188962.61 182008.74 0.1.1.1 1.1.1.1 4145719.79 3567778.91
Although not a descriptive statistic, let’s see how to get a chi-squared test while we are talking about tables. Of course, only a two-way table can be specified.
svychisq(~female+dmdborn4, nhc, statistic="adjWald") Design-based Wald test of association data: svychisq(~female + dmdborn4, nhc, statistic = "adjWald") F = 0.00019185, ndf = 1, ddf = 17, p-value = 0.9891
Graphing of continuous variables
Let’s start with a histogram. By default, the density is shown on the y-axis.
svyhist(~pad630, nhc)
Instead of the density on the y-axis, we can request the count. Notice the large count of respondents in the last column on the right. The coding of the ridageyr variable explains this.
svyhist(~ridageyr, nhc, probability = FALSE)
We can also create boxplots.
svyboxplot(~hsq496~1, nhc, all.outliers=TRUE)
We can break the boxplot by a grouping variable. The grouping variable must be a factor.
svyboxplot(~hsq496~factor(female), nhc, all.outliers=TRUE)
We can make barcharts. In the example below, we also preview some syntax used for subpopulation analysis.
barplt<-svyby(~pad675+pad630, ~female, nhc, na = TRUE, svymean) barplot(barplt,beside=TRUE,legend=TRUE)
dotchart(barplt)
We can make a scatterplot with the sampling weights corresponding to the bubble size.
svyplot(~pad675+pad630, nhc, style="bubble")
There are a variety of density and smoothed plots that can be made. Some examples are below.
smth<-svysmooth(~pad630, design=nhc) plot(smth)
dens<-svysmooth(~pad630, design=nhc,bandwidth=30) plot(dens)
dens1<-svysmooth(~pad630, design=nhc) plot(dens1)
Subpopulation Analysis
Before we continue with our descriptive statistics, we should pause to discuss the analysis of subpopulations. The analysis of subpopulations is one place where survey data and experimental data are quite different. If you have data from an experiment (or quasi-experiment), and you want to analyze the responses from, say, just the women, or just people over age 50, you can just delete the unwanted cases from the data set or use the by: prefix. Complex survey data are different. With survey data, you (almost) never get to delete any cases from the data set, even if you will never use them in any of your analyses. Instead, the survey package has two options that allow you to correctly analyze subpopulations of your survey data. These options are svyby and subset.survey.design. The subset.survey.design option is sort of like deleting unwanted cases (without really deleting them, of course), and the svyby option is very similar to by-group processing in that the results are shown for each group of the by-variable.
First, however, let’s take a second to see why deleting cases from a survey data set can be so problematic. There are two formulas that can used to calculate the standard errors. One formula is used when you do by-group processing or delete unwanted cases from the dataset, and survey statisticians call this the conditional approach. This is used when members of the subpopulation cannot appear in certain strata and therefore those strata should not be used in the calculation of the standard error. In practice, this rarely happens in public-use complex survey datasets. One reason is because the analyst usually does not know which combination of variables defines a particular stratum.
The other formula is used when you use the svyby option, and survey statisticians call this the unconditional approach. This is used when members of the subpopulation can be in any of the strata, even if there are some strata in the sample data that do not contain any members of the subpopulation. Because members of the subpopulation, all of the strata need to be used in the calculation of the standard error, and hence all of the data must be in the dataset. If the data set is subset (meaning that observations not to be included in the subpopulation are deleted from the data set), the standard errors of the estimates cannot be calculated correctly. When the svyby option is used, only the cases defined by the subpopulation are used in the calculation of the estimate, but all cases are used in the calculation of the standard errors. For more information on this issue, please see Sampling Techniques, Third Edition by William G. Cochran (1977) and Small Area Estimation by J. N. K. Rao (2003). A nice description of this issue given in Brady West’s 2009 Stata Conference (in Washington, D.C.).
Both svyby and subset.svy.design use the formula for the unconditional standard errors.
Let’s start by calculating the mean of age.
svymean(~ridageyr, nhc) mean SE ridageyr 37.185 0.6965
Now let’s calculate the mean of age for males and females. In this example, the variable female is the subpopulation variable.
svyby(~ridageyr, ~female, nhc, svymean) female ridageyr se 0 0 36.22918 0.8431945 1 1 38.09657 0.6713502
You can use more than one categorical variable to define the subpopulation. To do so, put + between the variables.
svyby(~ridageyr, ~dmdmartl+female, nhc, svymean) dmdmartl female ridageyr se 1.0 1 0 51.34608 0.7473752 2.0 2 0 67.49342 2.6062013 3.0 3 0 52.42614 0.7270806 4.0 4 0 47.16402 1.2085674 5.0 5 0 33.78962 1.3163405 6.0 6 0 40.55277 1.8945413 1.1 1 1 49.33443 0.6176356 2.1 2 1 71.39812 0.8627829 3.1 3 1 53.22067 0.7138037 4.1 4 1 47.16444 1.2200152 5.1 5 1 33.64747 1.4922117 6.1 6 1 35.94412 1.4423662
In the next example, three variables are used.
svyby(~pad630, ~dmdmartl+dmdeduc2+female, nhc, na = TRUE, svymean) dmdmartl dmdeduc2 female pad630 se 1.1.0 1 1 0 175.26250 28.12358 2.1.0 2 1 0 187.34360 62.96688 3.1.0 3 1 0 0.00000 0.00000 4.1.0 4 1 0 205.13325 35.53482 5.1.0 5 1 0 221.33212 60.76421 6.1.0 6 1 0 188.91514 71.24062 1.2.0 1 2 0 191.89233 18.18910 2.2.0 2 2 0 167.20401 14.10054 3.2.0 3 2 0 238.10555 29.71073 4.2.0 4 2 0 199.08510 38.03347 5.2.0 5 2 0 260.45360 41.06904 6.2.0 6 2 0 257.39812 43.61217 1.3.0 1 3 0 165.88490 9.75834 2.3.0 2 3 0 103.25693 20.07422 3.3.0 3 3 0 159.62009 17.52999 4.3.0 4 3 0 279.83812 48.20102 5.3.0 5 3 0 169.78003 18.53166 6.3.0 6 3 0 215.15772 24.03397 1.4.0 1 4 0 155.18399 13.21638 2.4.0 2 4 0 125.93947 35.75450 3.4.0 3 4 0 271.14720 56.91977 4.4.0 4 4 0 81.12185 29.50234 5.4.0 5 4 0 125.94122 11.38165 6.4.0 6 4 0 198.18129 25.91694 1.5.0 1 5 0 133.26699 10.84879 2.5.0 2 5 0 123.88774 52.93405 3.5.0 3 5 0 181.38341 33.00260 4.5.0 4 5 0 237.32606 84.81068 5.5.0 5 5 0 104.57660 26.15903 6.5.0 6 5 0 105.54748 7.58048 1.1.1 1 1 1 123.18726 33.99413 2.1.1 2 1 1 176.09593 72.50964 3.1.1 3 1 1 54.28887 15.49709 4.1.1 4 1 1 169.35988 80.57609 5.1.1 5 1 1 309.99124 93.15230 6.1.1 6 1 1 120.00000 0.00000 1.2.1 1 2 1 143.36162 14.00536 2.2.1 2 2 1 90.60535 17.21293 3.2.1 3 2 1 163.20222 47.28731 4.2.1 4 2 1 165.84824 46.57393 5.2.1 5 2 1 166.33346 46.45898 6.2.1 6 2 1 144.99504 24.39718 1.3.1 1 3 1 111.80803 12.92772 2.3.1 2 3 1 110.48051 37.21189 3.3.1 3 3 1 145.50738 25.43471 4.3.1 4 3 1 302.02249 117.29823 5.3.1 5 3 1 190.28528 55.64751 6.3.1 6 3 1 98.81543 21.16077 1.4.1 1 4 1 146.28879 22.89246 2.4.1 2 4 1 101.83852 11.74256 3.4.1 3 4 1 129.58833 28.49053 4.4.1 4 4 1 142.53219 26.01105 5.4.1 5 4 1 109.54923 15.62974 6.4.1 6 4 1 152.59225 23.74680 1.5.1 1 5 1 88.91933 8.30694 2.5.1 2 5 1 260.99621 63.18704 3.5.1 3 5 1 76.00663 11.32266 4.5.1 4 5 1 117.92163 34.79595 5.5.1 5 5 1 126.51918 19.29269 6.5.1 6 5 1 89.01130 23.17592
Sometimes you don’t want so much output. Rather, you just want the output for a specific group. You can get this by creating a subpopulation of the data with the subset function. In the example below, we obtain the output only for males.
smale <- subset(nhc,female == 0) summary(smale) Stratified 1 - level Cluster Sampling design (with replacement) With (31) clusters. subset(nhc, female == 0) Probabilities: Min. 1st Qu. Median Mean 3rd Qu. Max. 5.095e-06 3.044e-05 5.680e-05 6.521e-05 9.250e-05 2.326e-04 Stratum Sizes: 90 91 92 93 94 95 96 97 98 99 100 101 102 103 obs 426 496 452 288 351 355 353 289 352 338 355 357 315 129 design.PSU 3 3 3 2 2 2 2 2 2 2 2 2 2 2 actual.PSU 3 3 3 2 2 2 2 2 2 2 2 2 2 2 Data variables: [1] "dmdborn4" "dmdeduc2" "dmdmartl" "dmdyrsus" "female" "hsd010" "hsq496" [8] "hsq571" "pad615" "pad630" "pad660" "pad675" "pad680" "paq610" [15] "paq625" "paq640" "paq655" "paq665" "paq670" "paq710" "paq715" [22] "ridageyr" "sdmvpsu" "sdmvstra" "seqn" "wtint2yr" svymean(~ridageyr,design=smale) mean SE ridageyr 36.229 0.8432
Models
A wide variety of statistical models can be run with complex survey data. With only a few exceptions, the results of these analyses can be interpreted just as the results from the same analyses with experimental or quasi-experimental data. For example, if you run an OLS regression with weighted data, assuming that the sampling plan has been correctly specified, the regression coefficients are interpreted exactly as any other OLS regression coefficient. The same is true for the various logistic regression models, including binary logistic regression, ordinal logistic regression and multinomial logistic regression (of which there is not an example in this workshop). Most of the assumptions of these models are also the same. However, some assumptions, such as the assumption regarding the normality of the residuals in OLS regression, are often not meaningful because of the large sample size commonly seen with complex survey data.
t-tests
Let’s start doing some simple statistics. One-sample, paired-sample and independent-samples t-tests can be run with the svyttest function.
A one-sample t-test is shown below.
svyttest(pad675~0, nhc, na = TRUE) Design-based one-sample t-test data: pad675 ~ 0 t = 40.447, df = 16, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 63.84046 70.90254 sample estimates: mean 67.3715
In the example of the paired-samples t-test below, the “I” is used to tell R to leave the part in parentheses “as is”, meaning do the subtraction between the two variables. Hence, the formula means that the difference between
svyttest(I(pad660-pad675)~0, nhc, na = TRUE) Design-based one-sample t-test data: I(pad660 - pad675) ~ 0 t = 3.3059, df = 16, p-value = 0.004464 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 2.718683 12.437915 sample estimates: mean 7.578299
Now let’s run an independent-samples t-test.
svyttest(pad630~female, nhc) Design-based t-test data: pad630 ~ female t = -5.6277, df = 16, p-value = 3.779e-05 alternative hypothesis: true difference in mean is not equal to 0 95 percent confidence interval: -45.76433 -22.12149 sample estimates: difference in mean -33.94291
As you probably know, an independent-samples t-test tests the null hypothesis that the difference in the means of the two groups is 0. Another way to think about this type of t-test is to think of it as a linear regression with a single binary predictor. The intercept will be the mean of the reference group, and the coefficient will be the difference between the two groups.
We will start by running the t-test function as before, and then replicate the results using the svyglm function, which can be used to run a linear regression. The svyby function is used with the covmat argument to save the elements to a matrix so that we can use the svycontrast function to subtract the values. The purpose of this example is not to belabor the point about a t-test, but rather to show how to get a matrix of values and then compare those values with the svycontrast function in a simple example where the answer is already known.
svyttest(ridageyr~female, nhc) Design-based t-test data: ridageyr ~ female t = 2.9691, df = 16, p-value = 0.009043 alternative hypothesis: true difference in mean is not equal to 0 95 percent confidence interval: 0.634692 3.100076 sample estimates: difference in mean 1.867384
summary(svyglm(ridageyr~female, design=nhc)) Call: svyglm(formula = ridageyr ~ female, design = nhc) Survey design: svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2012) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 36.2292 0.8432 42.967 < 2e-16 *** female 1.8674 0.6289 2.969 0.00904 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 499.5325) Number of Fisher Scoring iterations: 2
a <- svyby(~ridageyr, ~female, nhc, na.rm.by = TRUE, svymean, covmat = TRUE) vcov(a) 0 1 0 0.7109769 0.3830638 1 0.3830638 0.4507111 a female ridageyr se 0 0 36.22918 0.8431945 1 1 38.09657 0.6713502
svycontrast(a, c( -1, 1)) contrast SE contrast 1.8674 0.6289 # 4.589723 - 6.153479
This example is similar to the previous example, except that here the svypredmeans function is used.
summary(svyglm(pad630~female, design=nhc)) Call: svyglm(formula = pad630 ~ female, design = nhc) Survey design: svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2012) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 155.627 7.008 22.206 1.90e-13 *** female -33.943 6.031 -5.628 3.78e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 21951.7) Number of Fisher Scoring iterations: 2
Note that the variable female cannot be in model if you want to get the predicted means for that variable.
ttest1 <- (svyglm(ridageyr~1, design=nhc)) summary(ttest1) Call: svyglm(formula = ridageyr ~ 1, design = nhc) Survey design: svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2012) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 37.1852 0.6965 53.39 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 500.4039) Number of Fisher Scoring iterations: 2
The variable female is used here to get the predicted means for each level of female.
svypredmeans(ttest1, ~female) mean SE 0 36.229 0.8432 1 38.097 0.6714
tt<-svyttest(pad630~female, nhc) tt Design-based t-test data: pad630 ~ female t = -5.6277, df = 16, p-value = 3.779e-05 alternative hypothesis: true difference in mean is not equal to 0 95 percent confidence interval: -45.76433 -22.12149 sample estimates: difference in mean -33.94291
We can get the confidence interval around the difference. In this example, we get the 90% confidence interval.
confint(tt, level=0.9) [1] -43.67864 -24.20718 attr(,"conf.level") [1] 0.9
Multiple linear regression
We need to use the summary function to get the standard errors, test statistics and p-values. Let’s start with a model that has no interaction terms. The outcome variable will be pad630, and the predictors will be female and hsq571.
summary(svyglm(pad630~female+hsq571, design=nhc, na.action = na.omit)) Call: svyglm(formula = pad630 ~ female + hsq571, design = nhc, na.action = na.omit) Survey design: svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2012) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 164.707 6.595 24.975 1.24e-13 *** female -39.296 6.325 -6.212 1.66e-05 *** hsq571 -11.722 20.267 -0.578 0.572 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 23558.77) Number of Fisher Scoring iterations: 2
Now let’s add an interaction between the two predictor variables.
summary(svyglm(pad630~female*hsq571, design=nhc, na.action = na.omit)) Call: svyglm(formula = pad630 ~ female * hsq571, design = nhc, na.action = na.omit) Survey design: svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2012) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 167.107 6.847 24.404 7.13e-13 *** female -44.520 6.827 -6.521 1.35e-05 *** hsq571 -40.975 18.282 -2.241 0.04174 * female:hsq571 67.630 19.484 3.471 0.00374 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 23441.51) Number of Fisher Scoring iterations: 2
glm1 <- (svyglm(pad630~female*hsq571, design=nhc, na.action = na.omit)) glm1 Stratified 1 - level Cluster Sampling design (with replacement) With (31) clusters. svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2012) Call: svyglm(formula = pad630 ~ female * hsq571, design = nhc, na.action = na.omit) Coefficients: (Intercept) female hsq571 female:hsq571 167.11 -44.52 -40.97 67.63 Degrees of Freedom: 1672 Total (i.e. Null); 14 Residual (8083 observations deleted due to missingness) Null Deviance: 40340000 Residual Deviance: 39190000 AIC: 21580 confint(glm1) 2.5 % 97.5 % (Intercept) 153.68606 180.527513 female -57.89966 -31.139511 hsq571 -76.80715 -5.141942 female:hsq571 29.44252 105.817225
This example is just like the previous one, only here factor notation is used. This is important when the categorical predictor has more than two levels.
summary(svyglm(pad630~factor(female)*factor(dmdmartl), design=nhc, na.action = na.omit)) Call: svyglm(formula = pad630 ~ factor(female) * factor(dmdmartl), design = nhc, na.action = na.omit) Survey design: svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2012) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 156.694 8.766 17.876 1.97e-06 *** factor(female)1 -38.069 12.438 -3.061 0.0222 * factor(dmdmartl)2 -24.083 23.185 -1.039 0.3390 factor(dmdmartl)3 52.729 28.804 1.831 0.1169 factor(dmdmartl)4 31.274 32.355 0.967 0.3711 factor(dmdmartl)5 -10.271 13.330 -0.771 0.4702 factor(dmdmartl)6 44.728 16.959 2.637 0.0387 * factor(female)1:factor(dmdmartl)2 26.414 28.913 0.914 0.3962 factor(female)1:factor(dmdmartl)3 -46.720 43.539 -1.073 0.3245 factor(female)1:factor(dmdmartl)4 51.033 51.334 0.994 0.3585 factor(female)1:factor(dmdmartl)5 23.355 18.086 1.291 0.2441 factor(female)1:factor(dmdmartl)6 -40.780 25.360 -1.608 0.1589 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 23215.65) Number of Fisher Scoring iterations: 2 ols1 <- (svyglm(pad630~1, design=nhc, na.action = na.omit)) predmarg<-svypredmeans(ols1, ~interaction(female,dmdmartl)) predmarg mean SE 1.1 118.63 9.5904 0.6 201.42 13.4189 0.1 156.69 8.7656 0.5 146.42 12.3533 1.5 131.71 13.3056 1.4 200.93 52.4999 1.6 122.57 14.8851 0.3 209.42 27.2862 1.2 120.96 11.1044 1.3 124.63 15.5059 0.4 187.97 38.8611 0.2 132.61 20.7758
Non-parametric tests
Non-parametric tests can also be done. Let’s start with a Wilcoxon signed rank test, which is the non-parametric analog of an independent-samples t-test.
wil <- svyranktest(hsq496~female, design = nhc, na = TRUE, test = c("wilcoxon")) wil Design-based KruskalWallis test data: hsq496 ~ female t = 6.3291, df = 16, p-value = 1.002e-05 alternative hypothesis: true difference in mean rank score is not equal to 0 sample estimates: difference in mean rank score 0.06896535
This is an example of a median test.
mtest <- svyranktest(hsq496~female, design = nhc, na = TRUE, test=("median")) mtest Design-based median test data: hsq496 ~ female t = 4.9504, df = 16, p-value = 0.0001446 alternative hypothesis: true difference in mean rank score is not equal to 0 sample estimates: difference in mean rank score 0.11726
This is an example of a Kruskal Wallis test, which is the non-parametric analog of a one-way ANOVA.
kwtest <- svyranktest(hsq496~female, design = nhc, na = TRUE, test=("KruskalWallis")) kwtest Design-based KruskalWallis test data: hsq496 ~ female t = 6.3291, df = 16, p-value = 1.002e-05 alternative hypothesis: true difference in mean rank score is not equal to 0 sample estimates: difference in mean rank score 0.06896535
Logistic regression
Let’s see a few examples of logistic regression.
logit1 <- (svyglm(paq665~factor(hsd010)+ridageyr, family=quasibinomial, design=nhc, na.action = na.omit)) summary(logit1) Call: svyglm(formula = paq665 ~ factor(hsd010) + ridageyr, design = nhc, family = quasibinomial, na.action = na.omit) Survey design: svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2012) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.717830 0.123638 5.806 8.40e-05 *** factor(hsd010)2 -0.053653 0.099863 -0.537 0.600902 factor(hsd010)3 -0.541820 0.104759 -5.172 0.000232 *** factor(hsd010)4 -0.981956 0.103744 -9.465 6.46e-07 *** factor(hsd010)5 -1.882124 0.201158 -9.356 7.31e-07 *** ridageyr -0.009443 0.002191 -4.311 0.001013 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for quasibinomial family taken to be 1.219511) Number of Fisher Scoring iterations: 4
In the next example, we will run the logistic regression on a subpopulation (respondents over age 20).
subset1 <- subset(nhc, ridageyr > 20) logit2 <- (svyglm(paq665~factor(hsd010)+ridageyr, family=quasibinomial, design=subset1, na.action = na.omit)) summary(logit2) Call: svyglm(formula = paq665 ~ factor(hsd010) + ridageyr, design = subset1, family = quasibinomial, na.action = na.omit) Survey design: subset(nhc, ridageyr > 20) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.86401 0.17432 4.956 0.000333 *** factor(hsd010)2 -0.10723 0.11507 -0.932 0.369739 factor(hsd010)3 -0.59868 0.11209 -5.341 0.000176 *** factor(hsd010)4 -1.09325 0.11226 -9.738 4.77e-07 *** factor(hsd010)5 -2.04923 0.19386 -10.571 1.96e-07 *** ridageyr -0.01097 0.00292 -3.756 0.002740 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for quasibinomial family taken to be 1.024088) Number of Fisher Scoring iterations: 4
We can also get a Wald test for a variable in the model.
regTermTest(logit2, ~ridageyr) Wald test for ridageyr in svyglm(formula = paq665 ~ factor(hsd010) + ridageyr, design = subset1, family = quasibinomial, na.action = na.omit) F = 14.10816 on 1 and 12 df: p= 0.0027404
Instead of getting an R-squared value as you do in linear regression, a pseudo-R-squared is given in logistic regression. There are many different versions of pseudo-R-squared, and two of them are available with the psrsq function.
psrsq(logit2, method = c("Cox-Snell")) [1] 0.05148869 psrsq(logit2, method = c("Nagelkerke")) [1] 0.06873682
Ordered logistic regression
Below is an example of an ordered logistic regression. Note that the outcome variable must be a factor.
ologit1 <- svyolr(factor(dmdeduc2)~factor(female)+factor(dmdborn4)+pad680, design = nhc, method = c("logistic")) summary(ologit1) Call: svyolr(factor(dmdeduc2) ~ factor(female) + factor(dmdborn4) + pad680, design = nhc, method = c("logistic")) Coefficients: Value Std. Error t value factor(female)1 0.097569476 0.0487417745 2.001763 factor(dmdborn4)1 0.709389138 0.1215114128 5.838045 pad680 0.001923994 0.0001494462 12.874160 Intercepts: Value Std. Error t value 1|2 -1.5280 0.1541 -9.9183 2|3 -0.3101 0.1562 -1.9849 3|4 0.8095 0.1344 6.0214 4|5 2.2157 0.1210 18.3118 (4234 observations deleted due to missingness)
Poisson regression
Poisson regression can be run. This is a type of count model (meaning that the outcome variable should be a count).
summary(svyglm(pad675~female, design=nhc, family=poisson())) Call: svyglm(formula = pad675 ~ female, design = nhc, family = poisson()) Survey design: svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2012) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.34112 0.03487 124.491 < 2e-16 *** female -0.27148 0.03698 -7.341 1.66e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 66.18153) Number of Fisher Scoring iterations: 5
Other types of analyses available in the survey package
There are many more types of analyses that are available in the survey package and in other packages that work with complex survey data. Below are a few examples.
The example below shows a principle components analysis (PCA).
pc <- svyprcomp(~pad630+pad675+hsd010, design=nhc,scale=TRUE,scores=TRUE) pc Standard deviations (1, .., p=3): [1] 1.3023573 0.8183481 0.7963491 Rotation (n x k) = (3 x 3): PC1 PC2 PC3 pad630 0.5769018 -0.5906013 -0.5642468 pad675 0.5690424 0.7861726 -0.2410879 hsd010 0.5859822 -0.1819963 0.7896216 biplot(pc, weight="scaled")
This is an example of Cronbach’s alpha.
svycralpha(~hsq571+dmdborn4, design=nhc, na.rm = TRUE) *alpha* 0.1339271