As a statistical programming language, R allows users to access precise statistics instead of simply printing a mass of output to the screen. The examples below highlight how to create a complex sample survey design object and then directly query specific coefficients, error terms, and other survey design-related information as needed.
This page uses the survey
package. Make sure that you can load
it before trying to run the examples on this page. If you do not have
a package installed, run: install.packages("packagename")
, or
if you see the version is out of date, run: update.packages()
.
library(survey)
Version info: Code for this page was tested in R version 3.0.1 (2013-05-16)
On: 2013-06-25
With: survey 3.29-5; knitr 1.2
The following example relies on the svyglm
function from the R survey
package. More detailed instructions and additional usage examples can be found on the survey
package’s survey-weighted generalized linear models page.
Example
This example is taken from Lehtonen and Pahkinen’s Practical Methods for Design and Analysis of Complex Surveys.
Page 107 Table 3.14
Model-assisted estimation results for the population total of ue91 sample of eight elements drawn from the Province’91 population.
Import the dataset from text directly into R using the read.table
function and the text=
parameter specifying the entire data set. The syntax n
indicates the end of one line of data.
province <-
read.table( text =
"id str clu wt ue91 meanz hou85 diffhou85 smplrat
1 1 1 4 4123 2867 26881 -24014 .25
2 1 4 4 760 2867 4896 -2029 .25
3 1 5 4 721 2867 3730 -863 .25
4 1 15 4 142 2867 556 2311 .25
5 1 18 4 187 2867 1463 1404 .25
6 1 26 4 331 2867 1946 921 .25
7 1 30 4 127 2867 834 2033 .25
8 1 31 4 219 2867 932 1935 .25" ,
header = TRUE
)
Add a column to the province
data.frame object. The column (variable) fpc
contains only the number 32 for every observation.
province$fpc <- 32
View the entire province
data.frame object:
province
## id str clu wt ue91 meanz hou85 diffhou85 smplrat fpc ## 1 1 1 1 4 4123 2867 26881 -24014 0.25 32 ## 2 2 1 4 4 760 2867 4896 -2029 0.25 32 ## 3 3 1 5 4 721 2867 3730 -863 0.25 32 ## 4 4 1 15 4 142 2867 556 2311 0.25 32 ## 5 5 1 18 4 187 2867 1463 1404 0.25 32 ## 6 6 1 26 4 331 2867 1946 921 0.25 32 ## 7 7 1 30 4 127 2867 834 2033 0.25 32 ## 8 8 1 31 4 219 2867 932 1935 0.25 32
Construct a survey.design
object called province.design
specifying a simple random sampling design. This `province.design`
object will be used for all subsequent analysis commands:
province.design <-
svydesign(
id = ~clu ,
strata = ~str ,
data = province ,
weights = ~wt ,
fpc = ~fpc
)
From this point forward, the sampling specifications of the province
data set’s survey design have been fixed and most analysis commands will simply use the set of tools outlined on the R survey package homepage, referring to the object province.design
at the design=
parameter of the specific R function or method.
Print the survey-weighted glm of ue91
and hou85
into a new object mysvyglm
and at the same time print it to the screen by encapsulaing the entire statement in parentheses.
(mysvyglm <- svyglm(ue91 ~ hou85, province.design))
## Stratified Independent Sampling design ## svydesign(id = ~clu, strata = ~str, data = province, weights = ~wt, ## fpc = ~fpc) ## ## Call: svyglm(formula = ue91 ~ hou85, province.design) ## ## Coefficients: ## (Intercept) hou85 ## 42.655 0.152 ## ## Degrees of Freedom: 7 Total (i.e. Null); 6 Residual ## Null Deviance: 12900000 ## Residual Deviance: 22500 AIC: 92.2
We view additional information about the model by running mysvyglm
inside the summary
function.
summary(mysvyglm)
## ## Call: ## svyglm(formula = ue91 ~ hou85, province.design) ## ## Survey design: ## svydesign(id = ~clu, strata = ~str, data = province, weights = ~wt, ## fpc = ~fpc) ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.27e+01 2.05e+01 2.08 0.083 . ## hou85 1.52e-01 7.17e-04 212.01 7.4e-13 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for gaussian family taken to be 3218) ## ## Number of Fisher Scoring iterations: 2
We can specify the df.resid=
parameter to use the survey design’s degrees of freedom (instead of the default df.resid=Inf
) to replicate Stata’s P>|t|
calculation method.
# matches stata summary(mysvyglm, df.resid = degf(province.design))
## ## Call: ## svyglm(formula = ue91 ~ hou85, province.design) ## ## Survey design: ## svydesign(id = ~clu, strata = ~str, data = province, weights = ~wt, ## fpc = ~fpc) ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.27e+01 2.05e+01 2.08 0.076 . ## hou85 1.52e-01 7.17e-04 212.01 1.4e-14 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for gaussian family taken to be 3218) ## ## Number of Fisher Scoring iterations: 2
Store the product of the sum of the survey weights (accessed from the original province
data.frame) and the meanz
column into a separate data.frame
object.
hou85.df <- data.frame(hou85 = sum(province$wt) * 2867)
These numbers are used in the prediction that matches the result shown in the last line of Table 3.14 on page 107.
predict(mysvyglm, newdata = hou85.df, total = sum(province$wt))
## link SE ## 1 15311 600
For more detail regarding the usage of the predict
function on survey-weighted data, type ?predict.svyglm
into the console.
See also
- The R survey package homepage
- Lumley, T. Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology)
- Damico, A. Step-by-step instructions to analyze major public-use survey data sets with the R language