As a statistical programming language, R allows users to access precise statistics instead of simply printing a mass of output to the screen. The examples below highlight how to create a complex sample survey design object and then directly query specific coefficients, error terms, and other survey design-related information as needed.
This page uses the following packages. Make sure that you can load
them before trying to run the examples on this page. If you do not have
a package installed, run: install.packages("packagename"), or
if you see the version is out of date, run: update.packages().
library(foreign) library(survey)
Version info: Code for this page was tested in R version 3.0.1 (2013-05-16)
On: 2013-06-25
With: survey 3.29-5; foreign 0.8-54; knitr 1.2
Example 1
This example is taken from Levy and Lemeshow’s Sampling of Populations page 247 simple one-stage cluster sampling.
Import the Stata dataset directly into R using the read.dta function from the foreign package:
mydata <- read.dta( "https://stats.idre.ucla.edu/stat/books/sop/tab9_1a.dta" , convert.factors = FALSE )
More detail about read.dta or any other R function can be viewed by typing a question mark in front of the function name in the R console. For example: ?read.dta.
In the R language, individual data sets are stored as data.frame objects, allowing users to load as many tables into working memory as necessary for the analysis. After loading the mydata table into memory, R functions can be run directly on this data table.
# the `class` of the `mydata` object class(mydata)
## [1] "data.frame"
# the first six rows of the data head(mydata)
## devlpmnt HH NGE65 NVSTNRS wt1 M N nge65dv hhneedvn prsndvn ## 1 2 1 2 1 2.5 5 20 40 1 1 ## 2 2 2 1 0 2.5 5 20 20 0 0 ## 3 2 3 2 0 2.5 5 20 40 0 0 ## 4 2 4 1 1 2.5 5 20 20 1 4 ## 5 2 5 1 0 2.5 5 20 20 0 0 ## 6 2 6 1 0 2.5 5 20 20 0 0
# the last six rows of the data tail(mydata)
## devlpmnt HH NGE65 NVSTNRS wt1 M N nge65dv hhneedvn prsndvn ## 35 5 15 2 1 2.5 5 20 40 1 15 ## 36 5 16 2 1 2.5 5 20 40 1 16 ## 37 5 17 1 0 2.5 5 20 20 0 0 ## 38 5 18 2 0 2.5 5 20 40 0 0 ## 39 5 19 2 1 2.5 5 20 40 1 19 ## 40 5 20 1 1 2.5 5 20 20 1 20
# the number of columns ncol(mydata)
## [1] 10
View a simple tabulation of your variable of interest. This table function accesses the hhneedvn column (variable) stored inside the mydata data.frame object.
table(mydata$hhneedvn)
## ## 0 1 ## 19 21
View a simple summary of your variable of interest. This summary function accesses the NVSTNRS column (variable) stored inside the mydata data.frame object.
summary(mydata$NVSTNRS)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.000 0.000 1.000 0.575 1.000 2.000
Initiate your svydesign object for a one-stage cluster design. This `mydesign` object will be used for all subsequent analysis commands:
mydesign <-
svydesign(
ids = ~devlpmnt ,
data = mydata ,
weights = ~wt1 ,
fpc = ~M
)
From this point forward, the sampling specifications of the mydata data set’s survey design have been fixed and most analysis commands will simply use the set of tools outlined on the R survey package homepage, referring to the object `mydesign` at the design= parameter of the specific R function or method.
View the survey design structure:
mydesign
## 1 - level Cluster Sampling design ## With (2) clusters. ## svydesign(ids = ~devlpmnt, data = mydata, weights = ~wt1, fpc = ~M)
View the survey design’s object class or type:
class(mydesign)
## [1] "survey.design2" "survey.design"
View the weighted total population of this survey design, by referring to the wt1 column from the original data.frame:
sum(mydata$wt1)
## [1] 100
View the number of unique PSUs (clusters) in this survey design, by referring to the devlpmnt column from the original data.frame:
length(unique(mydata$devlpmnt))
## [1] 2
View the degrees of freedom for this survey design object:
degf(mydesign)
## [1] 1
Count the number of unweighted observations where the variable NVSTNRS is not missing:
unwtd.count(~NVSTNRS, mydesign)
## counts SE ## counts 40 0
Print the mean and standard error of three variables:
svymean(~NVSTNRS + NGE65 + hhneedvn, mydesign)
## mean SE ## NVSTNRS 0.575 0.02 ## NGE65 1.675 0.02 ## hhneedvn 0.525 0.02
Print the weighted total and standard error of the same three variables:
svytotal(~NVSTNRS + NGE65 + hhneedvn, mydesign)
## total SE ## NVSTNRS 57.5 1.94 ## NGE65 167.5 1.94 ## hhneedvn 52.5 1.94
Alternatively, the result of a function call like svymean or svytotal can be stored into a secondary R object.
mysvymean <- svymean(~NVSTNRS + NGE65 + hhneedvn, mydesign, deff = TRUE)
Once created, this svymean can be queried independently from the mydesign object. For example, the coef and SE functions can directly extract those attributes:
coef(mysvymean)
## NVSTNRS NGE65 hhneedvn ## 0.575 1.675 0.525
SE(mysvymean)
## NVSTNRS NGE65 hhneedvn ## 0.0194 0.0194 0.0194
The design effect extraction function deff can only be used if the original svymean call that created the object mysvymean included the parameter deff = TRUE.
deff(mysvymean)
## NVSTNRS NGE65 hhneedvn ## 0.0708 0.0394 0.0977
We can use the confint function to obtain confidence intervals for the coefficient estimates.
confint(mysvymean)
## 2.5 % 97.5 % ## NVSTNRS 0.537 0.613 ## NGE65 1.637 1.713 ## hhneedvn 0.487 0.563
Note from our unweighted table, the variable hhneedvn was binary (composed strictly of zeroes and ones). To produce confidence intervals more accurate for proportions, users might start with the options discussed in ?svyciprop. For example:
svyciprop(~hhneedvn, mydesign, method = "logit")
## 2.5% 97.5% ## hhneedvn 0.525 0.499 0.55
Also note that the number of decimal places shown can be adjusted by modifying the digits parameter within the options function at any time.
options(digits = 10) SE(mysvymean)
## NVSTNRS NGE65 hhneedvn ## 0.01936491673 0.01936491673 0.01936491673
Save the ratio of NVSTNRS to NGE65 into a new object myratio and at the same time print it to the screen by encapsulaing the entire statement in parentheses.
(myratio <- svyratio(~NVSTNRS, ~NGE65, mydesign))
## Ratio estimator: svyratio.survey.design2(~NVSTNRS, ~NGE65, mydesign) ## Ratios= ## NGE65 ## NVSTNRS 0.3432835821 ## SEs= ## NGE65 ## NVSTNRS 0.007592393283
We can then use the confint function to obtain confidence intervals for the ratio.
confint(myratio)
## 2.5 % 97.5 % ## NVSTNRS/NGE65 0.3284027647 0.3581643995
We can specify the df= parameter to use the survey design’s degrees of freedom (instead of the default df=Inf) to replicate Stata’s confidence interval calculation method.
# matches stata confint(myratio, df = degf(mydesign))
## 2.5 % 97.5 % ## NVSTNRS/NGE65 0.2468130786 0.4397540856
Example 2
This example is taken from Lehtonen and Pahkinen’s Practical Methods for Design and Analysis of Complex Surveys. page 83 Table 3.6
Estimates from a one-stage CLU sample (n = 8); the Province’91 population.
Import the dataset from text directly into R using the read.table function and the text= parameter specifying the entire data set. The syntax n indicates the end of one line of data.
province <-
read.table( text =
"id str clu wt ue91 lab91
1 1 2 4 666 6016
2 1 2 4 528 3818
3 1 2 4 760 5919
4 1 2 4 187 1448
5 1 8 4 129 927
6 1 8 4 128 819
7 1 8 4 331 2543
8 1 8 4 568 4011" ,
header = TRUE
)
View the entire province data.frame object:
province
## id str clu wt ue91 lab91 ## 1 1 1 2 4 666 6016 ## 2 2 1 2 4 528 3818 ## 3 3 1 2 4 760 5919 ## 4 4 1 2 4 187 1448 ## 5 5 1 8 4 129 927 ## 6 6 1 8 4 128 819 ## 7 7 1 8 4 331 2543 ## 8 8 1 8 4 568 4011
Construct a survey.design object called province.design specifying a one-stage cluster design. This `province.design` object will be used for all subsequent analysis commands:
province.design <-
svydesign(
id = ~clu ,
strata = ~str ,
data = province ,
weight = ~wt
)
From this point forward, the sampling specifications of the province data set’s survey design have been fixed and most analysis commands will simply use the set of tools outlined on the R survey package homepage, referring to the object province.design at the design= parameter of the specific R function or method.
View the survey design structure:
province.design
## Stratified 1 - level Cluster Sampling design (with replacement) ## With (2) clusters. ## svydesign(id = ~clu, strata = ~str, data = province, weight = ~wt)
View the weighted total population of this survey design, by referring to the wt column from the original data.frame:
sum(province$wt)
## [1] 32
Count the number of unweighted observations where the variable ue91 is not missing:
unwtd.count(~ue91, province.design)
## counts SE ## counts 8 0
Print the mean and standard error of the ue91 variable:
svymean(~ue91, province.design)
## mean SE ## ue91 412.125 123.125
Print the weighted total and standard error of the ue91 variable:
svytotal(~ue91, province.design)
## total SE ## ue91 13188 3940
Save the ratio of ue91 to lab91 into a new object myratio and at the same time print it to the screen by encapsulaing the entire statement in parentheses.
(myratio <- svyratio(~ue91, ~lab91, province.design))
## Ratio estimator: svyratio.survey.design2(~ue91, ~lab91, province.design) ## Ratios= ## lab91 ## ue91 0.1292890475 ## SEs= ## lab91 ## ue91 0.006501776859
We can then use the confint function to obtain confidence intervals for the ratio.
confint(myratio)
## 2.5 % 97.5 % ## ue91/lab91 0.116545799 0.142032296
We can specify the df= parameter to use the survey design’s degrees of freedom (instead of the default df=Inf) to replicate Stata’s confidence interval calculation method.
# matches stata confint(myratio, df = degf(province.design))
## 2.5 % 97.5 % ## ue91/lab91 0.04667613956 0.2119019554
Print the median of the ue91 variable, including the confidence interval in the output.
svyquantile(~ue91, province.design, quantiles = 0.5, ci = TRUE)
## $quantiles ## 0.5 ## ue91 331 ## ## $CIs ## , , ue91 ## ## 0.5 ## (lower 128.0000000 ## upper) 752.4732291
See also
- The R survey package homepage
- Lumley, T. Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology)
- Damico, A. Step-by-step instructions to analyze major public-use survey data sets with the R language
