How do I analyze survey data with a stratified random sampling design?

As a statistical programming language, R allows users to access precise statistics instead of simply printing a mass of output to the screen. The examples below highlight how to create a complex sample survey design object and then directly query specific coefficients, error terms, and other survey design-related information as needed.

This page uses the following packages. Make sure that you can load them before trying to run the examples on this page. If you do not have a package installed, run: install.packages("packagename"), or if you see the version is out of date, run: update.packages().

library(foreign)
library(survey)

Version info: Code for this page was tested in R version 3.0.1 (2013-05-16) On: 2013-06-25 With: survey 3.29-5; foreign 0.8-54; knitr 1.2

Example 1

This example is taken from Levy and Lemeshow’s Sampling of Populations.

Page 136 stratified random sampling

Import the Stata dataset directly into R using the read.dta function from the foreign package:

mydata <- 
  read.dta( 
		"https://stats.idre.ucla.edu/stat/books/sop/hospsamp.dta" , 
		convert.factors = FALSE 
	)

More detail about read.dta or any other R function can be viewed by typing a question mark in front of the function name in the R console. For example: ?read.dta.

In the R language, individual data sets are stored as data.frame objects, allowing users to load as many tables into working memory as necessary for the analysis. After loading the mydata table into memory, R functions can be run directly on this data table.

# the `class` of the `mydata` object
class(mydata)

## [1] "data.frame"

# the first six rows of the data
head(mydata)

##   hospno oblevel weighta tothosp births
## 1     15       1    10.5      42    480
## 2     80       1    10.5      42    426
## 3     86       1    10.5      42    342
## 4    136       1    10.5      42    174
## 5      7       2    19.8      99   2022
## 6     26       2    19.8      99    576

# the last six rows of the data
tail(mydata)

##    hospno oblevel weighta tothosp births
## 10     28       3    2.83      17   3108
## 11     34       3    2.83      17   4674
## 12     39       3    2.83      17   2539
## 13    102       3    2.83      17   1610
## 14    119       3    2.83      17   4618
## 15    149       3    2.83      17   1781

# the number of columns
ncol(mydata)

## [1] 5

View a simple summary of your variable of interest. This summary function accesses the births column (variable) stored inside the mydata data.frame object.

summary(mydata$births)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     174     481    1610    1710    2280    4670

Initiate your svydesign object for a stratified random sampling design. This `mydesign` object will be used for all subsequent analysis commands:

mydesign <- 
  svydesign(
		id = ~1 ,
    strata = ~oblevel ,
		data = mydata ,
		weight = ~weighta ,
    fpc = ~tothosp
	)

From this point forward, the sampling specifications of the mydata data set’s survey design have been fixed and most analysis commands will simply use the set of tools outlined on the R survey package homepage, referring to the object `mydesign` at the design= parameter of the specific R function or method.

View the survey design structure:

mydesign

## Stratified Independent Sampling design
## svydesign(id = ~1, strata = ~oblevel, data = mydata, weight = ~weighta, 
##     fpc = ~tothosp)

View the survey design’s object class or type:

class(mydesign)

## [1] "survey.design2" "survey.design"

View the weighted total population of this survey design, by referring to the weighta column from the original data.frame:

sum(mydata$weighta)

## [1] 158

View the number of unique strata in this survey design, by referring to the oblevel column from the original data.frame:

length(unique(mydata$oblevel))

## [1] 3

View the degrees of freedom for this survey design object:

degf(mydesign)

## [1] 12

Count the number of unweighted observations where the variable births is not missing:

unwtd.count(~births, mydesign)

##        counts SE
## counts     15  0

Print the mean and standard error of the births variable:

svymean(~births, mydesign)

##        mean  SE
## births 1164 215

Print the weighted total and standard error of the births variable:

svytotal(~births, mydesign)

##         total    SE
## births 183983 34014

Print the weighted total and standard error of the births variable, broken down by the oblevel variable:

svyby(~births, ~oblevel, mydesign, svytotal)

##   oblevel births    se
## 1       1  14931  2670
## 2       2 117117 33068
## 3       3  51935  7508

Alternatively, the result of a function call like svymean or svytotal can be stored into a secondary R object.

mysvymean <- svymean(~births, mydesign, deff = TRUE)

Once created, this svymean can be queried independently from the mydesign object. For example, the coef and SE functions can directly extract those attributes:

coef(mysvymean)

## births 
##   1164

SE(mysvymean)

##        births
## births    215

The design effect extraction function deff can only be used if the original svymean call that created the object mysvymean included the parameter deff = TRUE.

deff(mysvymean)

## births 
##  0.704

We can use the confint function to obtain confidence intervals for the coefficient estimates.

confint(mysvymean)

##        2.5 % 97.5 %
## births   743   1586

Note that the number of decimal places shown can be adjusted by modifying the digits parameter within the options function at any time.

options(digits = 10)
SE(mysvymean)

##             births
## births 215.2806588

Example 2

This example is taken from Lehtonen and Pahkinen’s Practical Methods for Design and Analysis of Complex Surveys.

Page 74 Table 3.3 Estimates from an optimally allocated stratified simple random sample (n = 8); the Province’91 population.

NOTE: In this data set, the fpc changes with the strata. This is different from all of the previous examples.

Import the dataset from text directly into R using the read.table function and the text= parameter specifying the entire data set. The syntax n indicates the end of one line of data.

province <-
  read.table( text =
		"id str clu wt ue91 lab91 fpc
		1 1 1 1.75 4123 33786 7
		2 1 2 1.75 666 6016 7
		3 1 4 1.75 760 5919 7
		4 1 6 1.75 457 3022 7
		5 2 21 6.25 61 573 25
		6 2 25 6.25 262 1737 25
		7 2 26 6.25 331 2543 25
		8 2 27 6.25 98 545 25" ,
		header = TRUE
	)

View the entire province data.frame object:

province

##   id str clu   wt ue91 lab91 fpc
## 1  1   1   1 1.75 4123 33786   7
## 2  2   1   2 1.75  666  6016   7
## 3  3   1   4 1.75  760  5919   7
## 4  4   1   6 1.75  457  3022   7
## 5  5   2  21 6.25   61   573  25
## 6  6   2  25 6.25  262  1737  25
## 7  7   2  26 6.25  331  2543  25
## 8  8   2  27 6.25   98   545  25

Construct a survey.design object called province.design specifying a stratified random sampling design. This `province.design` object will be used for all subsequent analysis commands:

province.design <-
  svydesign(
		id = ~clu ,
		strata = ~str ,
		data = province ,
		weight = ~wt ,
		fpc = ~fpc
	)

From this point forward, the sampling specifications of the province data set’s survey design have been fixed and most analysis commands will simply use the set of tools outlined on the R survey package homepage, referring to the object province.design at the design= parameter of the specific R function or method.

View the survey design structure:

province.design

## Stratified Independent Sampling design
## svydesign(id = ~clu, strata = ~str, data = province, weight = ~wt, 
##     fpc = ~fpc)

View the weighted total population of this survey design, by referring to the wt column from the original data.frame:

sum(province$wt)

## [1] 32

Count the number of unweighted observations where the variable ue91 is not missing:

unwtd.count(~ue91, province.design)

##        counts SE
## counts      8  0

Print the mean and standard error of the ue91 variable:

svymean(~ue91, province.design)

##           mean        SE
## ue91 475.32812 133.73286

Print the weighted total and standard error of the ue91 variable:

svytotal(~ue91, province.design)

##        total        SE
## ue91 15210.5 4279.4516

Save the ratio of ue91 to lab91 into a new object myratio and at the same time print it to the screen by encapsulaing the entire statement in parentheses.

(myratio <- svyratio(~ue91, ~lab91, province.design))

## Ratio estimator: svyratio.survey.design2(~ue91, ~lab91, province.design)
## Ratios=
##             lab91
## ue91 0.1277787929
## SEs=
##               lab91
## ue91 0.003173563794

We can then use the confint function to obtain confidence intervals for the ratio.

confint(myratio)

##                   2.5 %       97.5 %
## ue91/lab91 0.1215587221 0.1339988636

We can specify the df= parameter to use the survey design’s degrees of freedom (instead of the default df=Inf) to replicate Stata’s confidence interval calculation method.

# matches stata
confint(myratio, df = degf(province.design))

##                  2.5 %       97.5 %
## ue91/lab91 0.120013362 0.1355442237

Print the median of the ue91 variable, including the confidence interval in the output.

svyquantile(~ue91, province.design, quantiles = 0.5, ci = TRUE)

## $quantiles
##         0.5
## ue91 189.84
## 
## $CIs
## , , ue91
## 
##                0.5
## (lower  61.0000000
## upper) 690.9203778

How do I analyze survey data with a stratified random sampling design? | R FAQ

Example 1

Example 2

See also