How do I analyze survey data with stratification after sampling (poststratification)?

As a statistical programming language, R allows users to access precise statistics instead of simply printing a mass of output to the screen. The examples below highlight how to create a complex sample survey design object and then directly query specific coefficients, error terms, and other survey design-related information as needed.

This page uses the following packages. Make sure that you can load them before trying to run the examples on this page. If you do not have a package installed, run: install.packages("packagename"), or if you see the version is out of date, run: update.packages().

library(foreign)
library(survey)

Version info: Code for this page was tested in R version 3.0.1 (2013-05-16) On: 2013-06-25 With: survey 3.29-5; foreign 0.8-54; knitr 1.2

Example 1

This example is taken from Levy and Lemeshow’s Sampling of Populations, page 174.

Import the Stata dataset directly into R using the read.dta function from the foreign package:

mydata <- 
	read.dta( 
		"https://stats.idre.ucla.edu/stat/data/dogcats.dta" , 
		convert.factors = FALSE 
	)

More detail about read.dta or any other R function can be viewed by typing a question mark in front of the function name in the R console. For example: ?read.dta.

In the R language, individual data sets are stored as data.frame objects, allowing users to load as many tables into working memory as necessary for the analysis. After loading the mydata table into memory, R functions can be run directly on this data table.

# the `class` of the `mydata` object
class(mydata)

## [1] "data.frame"

# the first six rows of the data
head(mydata)

##   id type visits totexp
## 1 14    1      4   40.5
## 2 32    1      5   40.5
## 3 17    1      4   61.8
## 4 38    2      1   27.5
## 5  2    1      5   50.1
## 6  4    1      3   45.8

# the last six rows of the data
tail(mydata)

##    id type visits totexp
## 45  8    2      2   29.9
## 46 10    1      4   42.4
## 47 22    1      2   45.1
## 48 44    1      2   60.5
## 49 50    1      2   39.3
## 50  3    2      2   27.1

# the number of columns
ncol(mydata)

## [1] 4

Add one column to the mydata data.frame object. This column (variable) fpc simply contains the number 1300 for every record.

mydata$fpc <- 1300

In order to confirm this addition was made to the data.frame, re-view the first six rows of the data.

head(mydata)

##   id type visits totexp  fpc
## 1 14    1      4   40.5 1300
## 2 32    1      5   40.5 1300
## 3 17    1      4   61.8 1300
## 4 38    2      1   27.5 1300
## 5  2    1      5   50.1 1300
## 6  4    1      3   45.8 1300

View a simple summary of your variable of interest. This summary function accesses the totexp column (variable) stored inside the mydata data.frame object.

summary(mydata$totexp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     7.1    27.3    41.8    39.7    53.0    66.2

Initiate your preliminary svydesign object in preparation for post-stratification. This design does not yet include the post-stratification weights.

preliminary.design <- 
  svydesign( 
		id = ~1 , 
		data = mydata ,
		fpc = ~fpc
	)

Create a secondary data.frame object with two columns in order to perform the actual post-stratification: 1) The variable used to post-stratify by (type). This column is data set-specific. 2) The post-stratification target weights (Freq). This column should always be called Freq because the R survey package searches for that column name.

ps.weights <-
  data.frame(
		type = c( 1 , 2 ) ,
		Freq = c( 850 , 450 ) 
	)

Re-create your svydesign object for a stratification after sampling design. This `mydesign` object will be used for all subsequent analysis commands:

mydesign <- 
  postStratify(
  	preliminary.design ,
		strata = ~type ,
		population = ps.weights
	)

Note that the postStratify function requires the preliminary.design svydesign object as opposed to the mydata data.frame object, unlike all prior complex sample survey design examples shown.

From this point forward, the sampling specifications of the mydata data set’s survey design have been fixed and most analysis commands will simply use the set of tools outlined on the R survey package homepage, referring to the object `mydesign` at the design= parameter of the specific R function or method.

View the survey design’s object class or type:

class(mydesign)

## [1] "survey.design2" "survey.design"

View the weighted total population of this survey design, by referring to the Freq column from the post-stratification specification data.frame:

sum(ps.weights$Freq)

## [1] 1300

View the number of unique poststrata in this survey design, by referring to the type column from the post-stratification target weights data.frame:

length(unique(ps.weights$type))

## [1] 2

View the degrees of freedom for this survey design object:

degf(mydesign)

## [1] 49

Count the number of unweighted observations where the variable totexp is not missing:

unwtd.count(~totexp, mydesign)

##        counts SE
## counts     50  0

Print the mean and standard error of this variable:

svymean(~totexp, mydesign)

##        mean   SE
## totexp 40.1 1.16

Print the weighted total and standard error of this variable:

svytotal(~totexp, mydesign)

##        total   SE
## totexp 52150 1513

Print the mean and standard error of the totexp variable, broken out by the type variable:

svyby(~totexp, ~type, mydesign, svymean)

##   type totexp   se
## 1    1   49.9 1.44
## 2    2   21.7 1.97

Alternatively, the result of a function call like svymean, svytotal, or svyby can be stored into a secondary R object.

mysvyby <- svyby(~totexp, ~type, mydesign, svytotal)

Once created, this svyby can be queried independently from the mydesign object. For example, the coef and SE functions can directly extract those attributes:

coef(mysvyby)

##     1     2 
## 42380  9770

SE(mysvyby)

## [1] 1227  884

We can use the confint function to obtain confidence intervals for the coefficient estimates.

confint(mysvyby)

##   2.5 % 97.5 %
## 1 39975  44785
## 2  8037  11503

We can specify the df= parameter to use the survey design’s degrees of freedom (instead of the default df=Inf) to replicate Stata’s confidence interval calculation method.

# matches stata
confint(mysvyby, df = degf(mydesign))

##   2.5 % 97.5 %
## 1 39914  44846
## 2  7993  11547

Also note that the number of decimal places shown can be adjusted by modifying the digits parameter within the options function at any time.

options(digits = 10)
confint(mysvyby, df = degf(mydesign))

##          2.5 %      97.5 %
## 1 39913.649214 44845.69454
## 2  7992.988243 11547.01176

Example 2

This example is taken from Lehtonen and Pahkinen’s Practical Methods for Design and Analysis of Complex Surveys.

Page 97 Table 3.10 A simple random sample drawn without replacement from the Province’91 population with poststratum weights.

Import the dataset from text directly into R using the read.table function and the text= parameter specifying the entire data set. The syntax n indicates the end of one line of data.

province <-
  read.table( text =
		"id str clu wt ue91 lab91 poststr gwt postwt sruv srcvs
		  1 1 1 4 4123 33786 1 .5833 2.333 .25 .43
		  2 1 4 4 760 5919 1 .5833 2.333 .25 .43
		  3 1 5 4 721 4930 1 .5833 2.333 .25 .43
		  4 1 15 4 142 675 2 1.2500 5.0000 .25 .20
		  5 1 18 4 187 1448 2 1.2500 5.0000 .25 .20
		  6 1 26 4 331 2543 2 1.2500 5.0000 .25 .20
		  7 1 30 4 127 1084 2 1.2500 5.0000 .25 .20
		  8 1 31 4 219 1330 2 1.2500 5.0000 .25 .20" ,
		header = TRUE
	)

View the entire province data.frame object:

province

##   id str clu wt ue91 lab91 poststr    gwt postwt sruv srcvs
## 1  1   1   1  4 4123 33786       1 0.5833  2.333 0.25  0.43
## 2  2   1   4  4  760  5919       1 0.5833  2.333 0.25  0.43
## 3  3   1   5  4  721  4930       1 0.5833  2.333 0.25  0.43
## 4  4   1  15  4  142   675       2 1.2500  5.000 0.25  0.20
## 5  5   1  18  4  187  1448       2 1.2500  5.000 0.25  0.20
## 6  6   1  26  4  331  2543       2 1.2500  5.000 0.25  0.20
## 7  7   1  30  4  127  1084       2 1.2500  5.000 0.25  0.20
## 8  8   1  31  4  219  1330       2 1.2500  5.000 0.25  0.20

Add one column to the province data.frame object. This column (variable) fpc simply contains the number 32 for every record.

province$fpc <- 32

Initiate your preliminary svydesign object in preparation for post-stratification. This design does not yet include the post-stratification weights.

province.prelim.design <- 
  svydesign( 
  	id = ~1 , 
		data = province ,
    weights = ~wt ,
		fpc = ~fpc
	)

Create a secondary data.frame object with two columns in order to perform the actual post-stratification: 1) The variable used to post-stratify by (poststr). This column is data set-specific. 2) The post-stratification target weights (Freq). This column should always be called Freq because the R survey package searches for that column name.

province.ps.weights <-
  data.frame(
		poststr = c( 1 , 2 ) ,
		Freq = c( 7 , 25 ) 
	)

Re-create your svydesign object for a stratification after sampling design. This `province.design` object will be used for all subsequent analysis commands:

province.design <- 
  postStratify(
  	province.prelim.design ,
		strata = ~poststr ,
		population = province.ps.weights
	)

Note that the postStratify function requires the preliminary.design svydesign object as opposed to the mydata data.frame object, unlike all prior complex sample survey design examples shown.

From this point forward, the sampling specifications of the province data set’s survey design have been fixed and most analysis commands will simply use the set of tools outlined on the R survey package homepage, referring to the object `province.design` at the design= parameter of the specific R function or method.

View the weighted total population of this survey design, by referring to the Freq column from the post-stratification specification data.frame:

sum(province.ps.weights$Freq)

## [1] 32

Count the number of unweighted observations where the variable ue91 is not missing:

unwtd.count(~ue91, province.design)

##        counts SE
## counts      8  0

Print the mean and standard error of the ue91 variable:

svymean(~ue91, province.design)

##          mean        SE
## ue91 565.8125 187.92645

Print the weighted total and standard error of the ue91 variable:

svytotal(~ue91, province.design)

##      total        SE
## ue91 18106 6013.6462

Save the ratio of ue91 to lab91 into a new object myratio and at the same time print it to the screen by encapsulaing the entire statement in parentheses.

(myratio <- svyratio(~ue91, ~lab91, province.design))

## Ratio estimator: svyratio.survey.design2(~ue91, ~lab91, province.design)
## Ratios=
##             lab91
## ue91 0.1297471605
## SEs=
##               lab91
## ue91 0.004385981284

We can then use the confint function to obtain confidence intervals for the ratio.

confint(myratio)

##                   2.5 %       97.5 %
## ue91/lab91 0.1211507951 0.1383435258

We can specify the df= parameter to use the survey design’s degrees of freedom (instead of the default df=Inf) to replicate Stata’s confidence interval calculation method.

# matches stata
confint(myratio, df = degf(province.design))

##                   2.5 %       97.5 %
## ue91/lab91 0.1193759628 0.1401183582

Print the median of the ue91 variable, including the confidence interval in the output.

svyquantile(~ue91, province.design, quantiles = 0.5, ci = TRUE)

## $quantiles
##        0.5
## ue91 193.4
## 
## $CIs
## , , ue91
## 
##                0.5
## (lower 130.1835058
## upper) 487.9189632

How do I analyze survey data with stratification after sampling (poststratification)? | R FAQ

Example 1

Example 2

See also