How can I do ratio estimation with survey data?

As a statistical programming language, R allows users to access precise statistics instead of simply printing a mass of output to the screen. The examples below highlight how to create a complex sample survey design object and then directly query specific coefficients, error terms, and other survey design-related information as needed.

This page uses the following packages. Make sure that you can load them before trying to run the examples on this page. If you do not have a package installed, run: install.packages("packagename"), or if you see the version is out of date, run: update.packages().

library(foreign)
library(survey)

Version info: Code for this page was tested in R version 3.0.1 (2013-05-16) On: 2013-06-25 With: survey 3.29-5; foreign 0.8-54; knitr 1.2

The following two examples rely on the svyratio function from the R survey package. More detailed instructions and additional usage examples can be found on the survey package’s ratio estimation page.

Example 1

This example is taken from Levy and Lemeshow’s Sampling of Populations Page 200 ratio estimation under simple random sampling.

Import the Stata dataset directly into R using the read.dta function from the foreign package:

mydata <- 
	read.dta( 
		"https://stats.idre.ucla.edu/stat/books/sop/tab7pt1.dta" , 
		convert.factors = FALSE 
	)

More detail about read.dta or any other R function can be viewed by typing a question mark in front of the function name in the R console. For example: ?read.dta.

In the R language, individual data sets are stored as data.frame objects, allowing users to load as many tables into working memory as necessary for the analysis. After loading the mydata table into memory, R functions can be run directly on this data table.

# the `class` of the `mydata` object
class(mydata)

## [1] "data.frame"

# the first six rows of the data
head(mydata)

##   area pharmexp totmedex totcnt  wt1
## 1    1   100000   300000      8 1.14
## 2    2    50000   200000      8 1.14
## 3    3    75000   300000      8 1.14
## 4    4   200000   600000      8 1.14
## 5    5   150000   450000      8 1.14
## 6    6   175000   520000      8 1.14

# the last six rows of the data
tail(mydata)

##   area pharmexp totmedex totcnt  wt1
## 2    2    50000   200000      8 1.14
## 3    3    75000   300000      8 1.14
## 4    4   200000   600000      8 1.14
## 5    5   150000   450000      8 1.14
## 6    6   175000   520000      8 1.14
## 7    8   150000   450000      8 1.14

# the number of columns
ncol(mydata)

## [1] 5

View simple summaries of your variables of interest. This summary function accesses the pharmexp and totmedex columns (variables) stored inside the mydata data.frame object.

summary(mydata[, c("pharmexp", "totmedex")])

##     pharmexp         totmedex     
##  Min.   : 50000   Min.   :200000  
##  1st Qu.: 87500   1st Qu.:300000  
##  Median :150000   Median :450000  
##  Mean   :128571   Mean   :402857  
##  3rd Qu.:162500   3rd Qu.:485000  
##  Max.   :200000   Max.   :600000

Initiate your svydesign object for a simple random sampling design. This `mydesign` object will be used for all subsequent analysis commands:

mydesign <- 
  svydesign( 
		id = ~1 , 
		data = mydata ,
		weights = ~wt1 ,
		fpc = ~totcnt
	)

From this point forward, the sampling specifications of the mydata data set’s survey design have been fixed and most analysis commands will simply use the set of tools outlined on the R survey package homepage, referring to the object `mydesign` at the design= parameter of the specific R function or method.

View the survey design structure:

mydesign

## Independent Sampling design
## svydesign(id = ~1, data = mydata, weights = ~wt1, fpc = ~totcnt)

Count the number of unweighted observations where the variables pharmexp and totmedex are not missing:

unwtd.count(~pharmexp + totmedex, mydesign)

##        counts SE
## counts      7  0

Print the ratio and standard error of these two variables:

svyratio(~pharmexp, ~totmedex, mydesign)

## Ratio estimator: svyratio.survey.design2(~pharmexp, ~totmedex, mydesign)
## Ratios=
##          totmedex
## pharmexp    0.319
## SEs=
##          totmedex
## pharmexp  0.00401

Alternatively, save the ratio of pharmexp to totmedex into a new object myratio and at the same time print it to the screen by encapsulaing the entire statement in parentheses.

(myratio <- svyratio(~pharmexp, ~totmedex, mydesign))

## Ratio estimator: svyratio.survey.design2(~pharmexp, ~totmedex, mydesign)
## Ratios=
##          totmedex
## pharmexp    0.319
## SEs=
##          totmedex
## pharmexp  0.00401

We can then use the confint function to obtain confidence intervals for the ratio.

confint(myratio)

##                   2.5 % 97.5 %
## pharmexp/totmedex 0.311  0.327

We can specify the df= parameter to use the survey design’s degrees of freedom (instead of the default df=Inf) to replicate Stata’s confidence interval calculation method.

# matches stata
confint(myratio, df = degf(mydesign))

##                   2.5 % 97.5 %
## pharmexp/totmedex 0.309  0.329

Example 2

This example is taken from Lehtonen and Pahkinen’s Practical Methods for Design and Analysis of Complex Surveys.

Page 102 Table 3.12

A simple random sample drawn without replacement from the Province’91 population prepared for ratio estimation.

Import the dataset from text directly into R using the read.table function and the text= parameter specifying the entire data set. The syntax n indicates the end of one line of data.

province <-
  read.table( text =
		"id str clu wt ue91 hou85 gwt adjwt smplrat
		1 1 1 4 4123 26881 .5562 2.2248 .25
		2 1 4 4 760 4896 .5562 2.2248 .25
		3 1 5 4 721 3730 .5562 2.2248 .25
		4 1 15 4 142 556 .5562 2.2248 .25
		5 1 18 4 187 1463 .5562 2.2248 .25
		6 1 26 4 331 1946 .5562 2.2248 .25
		7 1 30 4 127 834 .5562 2.2248 .25
		8 1 31 4 219 932 .5562 2.2248 .25" ,
		header = TRUE
	)

Add a column to the province data.frame object. The column (variable) fpc contains only the number 32 for every observation.

province$fpc <- 32

View the entire province data.frame object:

province

##   id str clu wt ue91 hou85   gwt adjwt smplrat fpc
## 1  1   1   1  4 4123 26881 0.556  2.22    0.25  32
## 2  2   1   4  4  760  4896 0.556  2.22    0.25  32
## 3  3   1   5  4  721  3730 0.556  2.22    0.25  32
## 4  4   1  15  4  142   556 0.556  2.22    0.25  32
## 5  5   1  18  4  187  1463 0.556  2.22    0.25  32
## 6  6   1  26  4  331  1946 0.556  2.22    0.25  32
## 7  7   1  30  4  127   834 0.556  2.22    0.25  32
## 8  8   1  31  4  219   932 0.556  2.22    0.25  32

Construct a survey.design object called province.design specifying a simple random sampling design. This `province.design` object will be used for all subsequent analysis commands:

province.design <-
  svydesign(
		id = ~clu ,
		data = province ,
		weights = ~wt ,
		fpc = ~fpc
	)

From this point forward, the sampling specifications of the province data set’s survey design have been fixed and most analysis commands will simply use the set of tools outlined on the R survey package homepage, referring to the object province.design at the design= parameter of the specific R function or method.

Print the ratio of ue91 to hou85 into a new object myratio and at the same time print it to the screen by encapsulaing the entire statement in parentheses.

(myratio <- svyratio(~ue91, ~hou85, province.design))

## Ratio estimator: svyratio.survey.design2(~ue91, ~hou85, province.design)
## Ratios=
##      hou85
## ue91  0.16
## SEs=
##        hou85
## ue91 0.00553

We can then use the confint function to obtain confidence intervals for the ratio.

confint(myratio)

##            2.5 % 97.5 %
## ue91/hou85 0.149  0.171

We can specify the df= parameter to use the survey design’s degrees of freedom (instead of the default df=Inf) to replicate Stata’s confidence interval calculation method.

# matches stata
confint(myratio, df = degf(province.design))

##            2.5 % 97.5 %
## ue91/hou85 0.147  0.173

How can I do ratio estimation with survey data? | R FAQ

Example 1

Example 2

See also