How do I analyze survey data with a stratified random sampling with allocation to strata design?

As a statistical programming language, R allows users to access precise statistics instead of simply printing a mass of output to the screen. The examples below highlight how to create a complex sample survey design object and then directly query specific coefficients, error terms, and other survey design-related information as needed.

This page uses the following packages. Make sure that you can load them before trying to run the examples on this page. If you do not have a package installed, run: install.packages("packagename"), or if you see the version is out of date, run: update.packages().

library(foreign)
library(survey)

Version info: Code for this page was tested in R version 3.0.1 (2013-05-16) On: 2013-06-25 With: survey 3.29-5; foreign 0.8-54; knitr 1.2

Example

This example is taken from Levy and Lemeshow’s Sampling of Populations Page 168 stratified random sampling.

Import the Stata dataset directly into R using the read.dta function from the foreign package:

mydata <- 
  read.dta( 
		"https://stats.idre.ucla.edu/stat/examples/sop/jacktwn2.dta" , 
		convert.factors = FALSE 
	)

More detail about read.dta or any other R function can be viewed by typing a question mark in front of the function name in the R console. For example: ?read.dta.

In the R language, individual data sets are stored as data.frame objects, allowing users to load as many tables into working memory as necessary for the analysis. After loading the mydata table into memory, R functions can be run directly on this data table.

# the `class` of the `mydata` object
class(mydata)

## [1] "data.frame"

# the first six rows of the data
head(mydata)

##   stratum race sex vitalsta quartile  npop1  npop nsamp twin nontwin
## 1       1    W  MM       LL     12.5 159471 39868    52    0       1
## 2       1    W  MM       LL     12.5 159471 39868    52    0       1
## 3       1    W  MM       LL     12.5 159471 39868    52    1       0
## 4       1    W  MM       LL     12.5 159471 39868    52    0       1
## 5       1    W  MM       LL     12.5 159471 39868    52    0       1
## 6       1    W  MM       LL     12.5 159471 39868    52    1       0
##   mult unk sampwt sex1 race1 vstatus quart1 fstquart twin1 malemale
## 1    0   0    767    1     1       1      1        1     2        1
## 2    0   0    767    1     1       1      1        1     2        1
## 3    0   0    767    1     1       1      1        1     1        1
## 4    0   0    767    1     1       1      1        1     2        1
## 5    0   0    767    1     1       1      1        1     2        1
## 6    0   0    767    1     1       1      1        1     1        1
##   sexb raceb quartrec
## 1    3     1        3
## 2    3     1        3
## 3    3     1        3
## 4    3     1        3
## 5    3     1        3
## 6    3     1        3

# the last six rows of the data
tail(mydata)

##     stratum race sex vitalsta quartile npop1 npop nsamp twin nontwin
## 826      33    B  FF       LL     87.5   606  152    23    1       0
## 827      33    B  FF       LL     87.5   606  152    23    0       1
## 828      33    B  FF       LL     87.5   606  152    23    0       1
## 829      33    B  FF       LL     87.5   606  152    23    0       1
## 830      33    B  FF       LL     87.5   606  152    23    0       1
## 831      33    B  FF       LL     87.5   606  152    23    0       1
##     mult unk sampwt sex1 race1 vstatus quart1 fstquart twin1 malemale
## 826    0   0   6.59    3     2       1      3        2     1        2
## 827    0   0   6.59    3     2       1      3        2     2        2
## 828    0   0   6.59    3     2       1      3        2     2        2
## 829    0   0   6.59    3     2       1      3        2     2        2
## 830    0   0   6.59    3     2       1      3        2     2        2
## 831    0   0   6.59    3     2       1      3        2     2        2
##     sexb raceb quartrec
## 826    1     2        1
## 827    1     2        1
## 828    1     2        1
## 829    1     2        1
## 830    1     2        1
## 831    1     2        1

# the number of columns
ncol(mydata)

## [1] 23

View a simple tabulation of your variable of interest. This table function accesses the twin column (variable) stored inside the mydata data.frame object.

table(mydata$twin)

## 
##   0   1 
## 698 133

Initiate your svydesign object for a stratified random sampling with allocation to strata design. This `mydesign` object will be used for all subsequent analysis commands:

mydesign <- 
  svydesign( 
		id = ~1 , 
		strata = ~stratum ,
		data = mydata ,		
		weight = ~sampwt ,
		fpc = ~npop
	)

From this point forward, the sampling specifications of the mydata data set’s survey design have been fixed and most analysis commands will simply use the set of tools outlined on the R survey package homepage, referring to the object `mydesign` at the design= parameter of the specific R function or method.

View the survey design structure:

mydesign

## Stratified Independent Sampling design
## svydesign(id = ~1, strata = ~stratum, data = mydata, weight = ~sampwt, 
##     fpc = ~npop)

View the survey design’s object class or type:

class(mydesign)

## [1] "survey.design2" "survey.design"

View the weighted total population of this survey design, by referring to the sampwt column from the original data.frame:

sum(mydata$sampwt)

## [1] 256998

View the number of unique strata in this survey design, by referring to the stratum column from the original data.frame:

length(unique(mydata$stratum))

## [1] 18

View the degrees of freedom for this survey design object:

degf(mydesign)

## [1] 813

Count the number of unweighted observations where the variable twin is not missing:

unwtd.count(~twin, mydesign)

##        counts SE
## counts    831  0

Print the mean and standard error of the twin variable:

svymean(~twin, mydesign)

##       mean   SE
## twin 0.101 0.01

Print the weighted total and standard error of the twin variable:

svytotal(~twin, mydesign)

##      total   SE
## twin 26055 3791

Print the weighted total and standard error of the twin variable, broken out by the quart1 variable:

svyby(~twin, ~quart1, mydesign, svytotal)

##   quart1  twin   se
## 1      1 19184 2662
## 2      2  6738 2697
## 3      3   134  127

Print the confidence interval of the previous svyby function, both with and without the df= parameter:

confint(svyby(~twin, ~quart1, mydesign, svytotal))

##   2.5 % 97.5 %
## 1 13967  24400
## 2  1453  12023
## 3  -115    382

# matches stata
confint(svyby(~twin, ~quart1, mydesign, svytotal), df = degf(mydesign))

##   2.5 % 97.5 %
## 1 13959  24408
## 2  1445  12031
## 3  -115    382

Alternatively, the result of a function call like svymean, svytotal, or svyby can be stored into a secondary R object.

mysvymean <- svymean(~twin, mydesign)

Once created, this svymean can be queried independently from the mydesign object. For example, the coef and SE functions can directly extract those attributes:

coef(mysvymean)

##  twin 
## 0.101

SE(mysvymean)

##        twin
## twin 0.0148

We can use the confint function to obtain confidence intervals for the coefficient estimates.

confint(mysvymean)

##       2.5 % 97.5 %
## twin 0.0725   0.13

Note from our unweighted table, the variable twin was binary (composed strictly of zeroes and ones). To produce confidence intervals more accurate for proportions, users might start with the options discussed in ?svyciprop. For example:

svyciprop(~twin, mydesign, method = "logit")

##               2.5% 97.5%
## twin 0.1014 0.0759  0.13

Also note that the number of decimal places shown can be adjusted by modifying the digits parameter within the options function at any time.

options(digits = 10)
SE(mysvymean)

##               twin
## twin 0.01475125727

How do I analyze survey data with a stratified random sampling with allocation to strata design? | R FAQ

Example

See also