How do I analyze survey data with a stratified design with certainty PSUs?

As a statistical programming language, R allows users to access precise statistics instead of simply printing a mass of output to the screen. The examples below highlight how to create a complex sample survey design object and then directly query specific coefficients, error terms, and other survey design-related information as needed.

This page uses the survey package. Make sure that you can load it before trying to run the examples on this page. If you do not have a package installed, run: install.packages("packagename"), or if you see the version is out of date, run: update.packages().

library(survey)

Version info: Code for this page was tested in R version 3.0.1 (2013-05-16) On: 2013-06-25 With: survey 3.29-5; knitr 1.2

Example

This example is taken from Lehtonen and Pahkinen’s Practical Methods for Design and Analysis of Complex Surveys. Page 60 Table 2.8 Estimates under a PPSSYS design (n = 8); the Province’91 population.

NOTE: The certainty PSU (the first line of the data) was entered twice and the weight was changed from 1 to .5 for each observation. This is necessary because you need to have two observations in each strata.

Import the dataset from text directly into R using the read.table function and the text= parameter specifying the entire data set. The syntax n indicates the end of one line of data.

mydata <-
  read.table( text =
		"id str clu wt hou85 ue91 lab91
		1 2 1  0.5    26881 4123 33786
		2 2 1  0.5    26881 4123 33786
		3 1 10 1.004  9230  1623 13727
		4 1 4  1.893  4896  760  5919
		5 1 7  2.173  4264  767  5823
		6 1 32 2.971  3119  568  4011
		7 1 26 4.762  1946  331  2543
		8 1 18 6.335  1463  187  1448
		9 1 13 13.730 675   129  927" ,
		header = TRUE
	)

More detail about read.table or any other R function can be viewed by typing a question mark in front of the function name in the R console. For example: ?read.table.

In the R language, individual data sets are stored as data.frame objects, allowing users to load as many tables into working memory as necessary for the analysis. After loading the mydata table into memory, R functions can be run directly on this data table.

# the `class` of the `mydata` object
class(mydata)

## [1] "data.frame"

# the first six rows of the data
head(mydata)

##   id str clu   wt hou85 ue91 lab91
## 1  1   2   1 0.50 26881 4123 33786
## 2  2   2   1 0.50 26881 4123 33786
## 3  3   1  10 1.00  9230 1623 13727
## 4  4   1   4 1.89  4896  760  5919
## 5  5   1   7 2.17  4264  767  5823
## 6  6   1  32 2.97  3119  568  4011

# the last six rows of the data
tail(mydata)

##   id str clu    wt hou85 ue91 lab91
## 4  4   1   4  1.89  4896  760  5919
## 5  5   1   7  2.17  4264  767  5823
## 6  6   1  32  2.97  3119  568  4011
## 7  7   1  26  4.76  1946  331  2543
## 8  8   1  18  6.33  1463  187  1448
## 9  9   1  13 13.73   675  129   927

# the number of columns
ncol(mydata)

## [1] 7

View a simple summary of your variable of interest. This summary function accesses the ue91 column (variable) stored inside the mydata data.frame object.

summary(mydata$ue91)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     129     331     760    1400    1620    4120

Initiate your svydesign object for a stratified design with certainty PSUs. This `mydesign` object will be used for all subsequent analysis commands:

mydesign <- 
  svydesign(
		id = ~clu ,
		data = mydata ,
		weight = ~wt ,
		strata = ~str
	)

From this point forward, the sampling specifications of the mydata data set’s survey design have been fixed and most analysis commands will simply use the set of tools outlined on the R survey package homepage, referring to the object `mydesign` at the design= parameter of the specific R function or method.

View the survey design structure:

mydesign

## Stratified 1 - level Cluster Sampling design (with replacement)
## With (8) clusters.
## svydesign(id = ~clu, data = mydata, weight = ~wt, strata = ~str)

View the survey design’s object class or type:

class(mydesign)

## [1] "survey.design2" "survey.design"

View the weighted total population of this survey design, by referring to the wt column from the original data.frame:

sum(mydata$wt)

## [1] 33.9

View the number of unique PSUs (clusters) in this survey design, by referring to the clu column from the original data.frame:

length(unique(mydata$clu))

## [1] 8

View the number of unique strata in this survey design, by referring to the str column from the original data.frame:

length(unique(mydata$str))

## [1] 2

View the degrees of freedom for this survey design object:

degf(mydesign)

## [1] 6

Count the number of unweighted observations where the variable ue91 is not missing:

unwtd.count(~ue91, mydesign)

##        counts SE
## counts      9  0

Attempt (and fail) to print the mean and standard error of the ue91 variable:

svymean(~ue91, mydesign)

## Error: Stratum (2) has only one PSU at stage 1

This command returns an error because the current survey design object contains a stratum with only one sampled PSU. This issue is discussed in-depth on the Lonely PSU page of the R survey package. When this situation occurs, the software defaults to an error in order to capture the user’s attention and force a choice about how the software should treat these observations when calculating the design-adjusted variance. This behavior can be specified by modifying the survey.lonely.psu parameter within the options function at any time.

The most conservative approach would be to center any single-PSU strata around the sample grand mean rather than the stratum mean:

options(survey.lonely.psu = "adjust")

svymean(~ue91, mydesign)

##      mean  SE
## ue91  445 186

However, most proprietary statistical software packages have single-PSU strata make no contribution to the variance by default:

options(survey.lonely.psu = "certainty")

svymean(~ue91, mydesign)

##      mean  SE
## ue91  445 150

Once the survey.lonely.psu option has been specified, that behavior will be maintained for the duration of the R session.

Print the weighted total and standard error of the ue91 variable:

svytotal(~ue91, mydesign)

##      total  SE
## ue91 15077 521

Alternatively, the result of a function call like svymean or svytotal can be stored into a secondary R object.

mysvymean <- svymean(~ue91, mydesign, deff = TRUE)

Once created, this svymean can be queried independently from the mydesign object. For example, the coef and SE functions can directly extract those attributes:

coef(mysvymean)

## ue91 
##  445

SE(mysvymean)

##      ue91
## ue91  150

The design effect extraction function deff can only be used if the original svymean call that created the object mysvymean included the parameter deff = TRUE.

deff(mysvymean)

##  ue91 
## 0.485

We can use the confint function to obtain confidence intervals for the coefficient estimates.

confint(mysvymean)

##      2.5 % 97.5 %
## ue91   150    740

Save the ratio of ue91 to lab91 into a new object myratio and at the same time print it to the screen by encapsulaing the entire statement in parentheses.

(myratio <- svyratio(~ue91, ~lab91, mydesign))

## Ratio estimator: svyratio.survey.design2(~ue91, ~lab91, mydesign)
## Ratios=
##      lab91
## ue91 0.128
## SEs=
##        lab91
## ue91 0.00222

We can then use the confint function to obtain confidence intervals for the ratio.

confint(myratio)

##            2.5 % 97.5 %
## ue91/lab91 0.124  0.133

We can specify the df= parameter to use the survey design’s degrees of freedom (instead of the default df=Inf) to replicate Stata’s confidence interval calculation method.

# matches stata
confint(myratio, df = degf(mydesign))

##            2.5 % 97.5 %
## ue91/lab91 0.123  0.134

Print the median of the ue91 variable, including the confidence interval in the output.

svyquantile(~ue91, mydesign, quantiles = 0.5, ci = TRUE)

## $quantiles
##      0.5
## ue91 158
## 
## $CIs
## , , ue91
## 
##         0.5
## (lower  129
## upper) 4123

How do I analyze survey data with a stratified design with certainty PSUs? | R FAQ

Example

See also