How do I analyze survey data with a probability proportional to size sampling design?

As a statistical programming language, R allows users to access precise statistics instead of simply printing a mass of output to the screen. The examples below highlight how to create a complex sample survey design object and then directly query specific coefficients, error terms, and other survey design-related information as needed.

This page uses the following packages. Make sure that you can load them before trying to run the examples on this page. If you do not have a package installed, run: install.packages("packagename"), or if you see the version is out of date, run: update.packages().

library(foreign)
library(survey)

Version info: Code for this page was tested in R version 3.0.1 (2013-05-16) On: 2013-06-25 With: survey 3.29-5; foreign 0.8-54; knitr 1.2

Example 1

This example is taken from Levy and Lemeshow’s Sampling of Populations.

Page 350 cluster sampling with unequal probabilities: probability proportional to size sampling

Import the Stata dataset directly into R using the read.dta function from the foreign package:

mydata <- 
  read.dta( 
		"https://stats.idre.ucla.edu/stat/examples/sop/hospslct.dta" , 
		convert.factors = FALSE 
	)

More detail about read.dta or any other R function can be viewed by typing a question mark in front of the function name in the R console. For example: ?read.dta.

In the R language, individual data sets are stored as data.frame objects, allowing users to load as many tables into working memory as necessary for the analysis. After loading the mydata table into memory, R functions can be run directly on this data table.

# the `class` of the `mydata` object
class(mydata)

## [1] "data.frame"

# the first six rows of the data
head(mydata)

##   drawing hospno admiss lifethrt dxdead    p1      p2        p wstar
## 1       1      2   5036        1      1 0.411 0.00198 0.000816  1001
## 2       1      2   5036        0      0 0.411 0.00198 0.000816  1001
## 3       1      2   5036        0      0 0.411 0.00198 0.000816  1001
## 4       1      2   5036        0      0 0.411 0.00198 0.000816  1001
## 5       1      2   5036        0      0 0.411 0.00198 0.000816  1001
## 6       1      2   5036        0      0 0.411 0.00198 0.000816  1001

# the last six rows of the data
tail(mydata)

##    drawing hospno admiss lifethrt dxdead    p1      p2        p wstar
## 45       5      9   4672        0      0 0.387 0.00214 0.000828  1001
## 46       5      9   4672        0      0 0.387 0.00214 0.000828  1001
## 47       5      9   4672        0      0 0.387 0.00214 0.000828  1001
## 48       5      9   4672        0      0 0.387 0.00214 0.000828  1001
## 49       5      9   4672        0      0 0.387 0.00214 0.000828  1001
## 50       5      9   4672        0      0 0.387 0.00214 0.000828  1001

# the number of columns
ncol(mydata)

## [1] 9

View a simple tabulation of your variable of interest. This table function accesses the lifethrt column (variable) stored inside the mydata data.frame object.

table(mydata$lifethrt)

## 
##  0  1 
## 44  6

Initiate your svydesign object for a probability proportional to size sampling design. This `mydesign` object will be used for all subsequent analysis commands:

mydesign <- 
  svydesign(
		id = ~drawing ,
		data = mydata ,
		weight = ~wstar
	)

From this point forward, the sampling specifications of the mydata data set’s survey design have been fixed and most analysis commands will simply use the set of tools outlined on the R survey package homepage, referring to the object `mydesign` at the design= parameter of the specific R function or method.

View the survey design structure:

mydesign

## 1 - level Cluster Sampling design (with replacement)
## With (5) clusters.
## svydesign(id = ~drawing, data = mydata, weight = ~wstar)

View the survey design’s object class or type:

class(mydesign)

## [1] "survey.design2" "survey.design"

View the weighted total population of this survey design, by referring to the wstar column from the original data.frame:

sum(mydata$wstar)

## [1] 50056

View the number of unique PSUs (clusters) in this survey design, by referring to the drawing column from the original data.frame:

length(unique(mydata$drawing))

## [1] 5

View the degrees of freedom for this survey design object:

degf(mydesign)

## [1] 4

Count the number of unweighted observations where the variable lifethrt is not missing:

unwtd.count(~lifethrt, mydesign)

##        counts SE
## counts     50  0

Print the mean and standard error of two variables:

svymean(~lifethrt + dxdead, mydesign)

##          mean   SE
## lifethrt 0.12 0.02
## dxdead   0.04 0.02

Print the weighted total and standard error of the same two variables:

svytotal(~lifethrt + dxdead, mydesign)

##          total   SE
## lifethrt  6007 1001
## dxdead    2002 1226

Alternatively, the result of a function call like svymean or svytotal can be stored into a secondary R object.

mysvymean <- svymean(~lifethrt + dxdead, mydesign, deff = TRUE)

Once created, this svymean can be queried independently from the mydesign object. For example, the coef and SE functions can directly extract those attributes:

coef(mysvymean)

## lifethrt   dxdead 
##     0.12     0.04

SE(mysvymean)

## lifethrt   dxdead 
##   0.0200   0.0245

The design effect extraction function deff can only be used if the original svymean call that created the object mysvymean included the parameter deff = TRUE.

deff(mysvymean)

## lifethrt   dxdead 
##    0.186    0.766

We can use the confint function to obtain confidence intervals for the coefficient estimates.

confint(mysvymean)

##             2.5 % 97.5 %
## lifethrt  0.08080  0.159
## dxdead   -0.00801  0.088

Note from our unweighted table, the variable lifethrt was binary (composed strictly of zeroes and ones). To produce confidence intervals more accurate for proportions, users might start with the options discussed in ?svyciprop. For example:

svyciprop(~lifethrt, mydesign, method = "logit")

##                   2.5% 97.5%
## lifethrt 0.1200 0.0898  0.16

Also note that the number of decimal places shown can be adjusted by modifying the digits parameter within the options function at any time.

options(digits = 10)
SE(mysvymean)

##      lifethrt        dxdead 
## 0.02000000000 0.02449489743

Save the ratio of dxdead to lifethrt into a new object myratio and at the same time print it to the screen by encapsulaing the entire statement in parentheses.

( myratio <- svyratio( ~dxdead , ~lifethrt , mydesign ) )

## Ratio estimator: svyratio.survey.design2(~dxdead, ~lifethrt, mydesign)
## Ratios=
##            lifethrt
## dxdead 0.3333333333
## SEs=
##            lifethrt
## dxdead 0.2324055629

We can then use the confint function to obtain confidence intervals for the ratio.

confint(myratio)

##                         2.5 %       97.5 %
## dxdead/lifethrt -0.1221731998 0.7888398665

We can specify the df= parameter to use the survey design’s degrees of freedom (instead of the default df=Inf) to replicate Stata’s confidence interval calculation method.

# matches stata
confint(myratio, df = degf(mydesign))

##                         2.5 %       97.5 %
## dxdead/lifethrt -0.3119279543 0.9785946209

Example 2

This example is taken from Levy and Lemeshow’s Sampling of Populations.

Page 351 cluster sampling with unequal probabilities: probability proportional to size sampling

NOTE: The following recodes are necessary to create a new weight variable, w2star.

# records where `hospno` is two, create a new variable `tl` with the value 785 throughout.
mydata[ mydata$hospno == 2 , 'tl' ] <- 785

# records where `hospno` is five, set `tl` to 3404
mydata[ mydata$hospno == 5 , 'tl' ] <- 3404

# records where `hospno` is nine, set `tl` to 778
mydata[ mydata$hospno == 9 , 'tl' ] <- 778

# create a new `w2star` variable using `transform`.
mydata <-
  transform(
		mydata ,
		w2star =
			( admiss / 50 ) * ( 7087 / tl )
	)

The variable w2star has been set equal to the product of admiss variable divided by fifty and 7,087 divided by the newly-created tl variable.

Initiate your svydesign object for a probability proportional to size sampling design. This `mydesign2` object will be used for all subsequent analysis commands:

mydesign2 <- 
  svydesign(
  	id = ~drawing ,
		data = mydata ,
		weight = ~w2star
	)

This new survey design object differs from the mydesign object in its weighting by the newly-created w2star variable.

Print the mean and standard error of two variables:

svymean(~lifethrt + dxdead, mydesign2)

##                 mean      SE
## lifethrt 0.121904297 0.02142
## dxdead   0.034287108 0.02300

Print the weighted total and standard error of the same two variables:

svytotal(~lifethrt + dxdead, mydesign2)

##              total        SE
## lifethrt 6259.1758 1277.3220
## dxdead   1760.4715 1079.0434

Print the ratio of dxdead to lifethrt using the new survey design object:

svyratio(~dxdead, ~lifethrt, mydesign2)

## Ratio estimator: svyratio.survey.design2(~dxdead, ~lifethrt, mydesign2)
## Ratios=
##            lifethrt
## dxdead 0.2812625046
## SEs=
##            lifethrt
## dxdead 0.2114834755

How do I analyze survey data with a probability proportional to size sampling design? | R FAQ

Example 1

Example 2

See also