How can I do regression estimation with survey data?

As a statistical programming language, R allows users to access precise statistics instead of simply printing a mass of output to the screen. The examples below highlight how to create a complex sample survey design object and then directly query specific coefficients, error terms, and other survey design-related information as needed.

This page uses the survey package. Make sure that you can load it before trying to run the examples on this page. If you do not have a package installed, run: install.packages("packagename"), or if you see the version is out of date, run: update.packages().

library(survey)

Version info: Code for this page was tested in R version 3.0.1 (2013-05-16) On: 2013-06-25 With: survey 3.29-5; knitr 1.2

The following example relies on the svyglm function from the R survey package. More detailed instructions and additional usage examples can be found on the survey package’s survey-weighted generalized linear models page.

Example

This example is taken from Lehtonen and Pahkinen’s Practical Methods for Design and Analysis of Complex Surveys.

Page 107 Table 3.14

Model-assisted estimation results for the population total of ue91 sample of eight elements drawn from the Province’91 population.

Import the dataset from text directly into R using the read.table function and the text= parameter specifying the entire data set. The syntax n indicates the end of one line of data.

province <-
  read.table( text =
		"id str clu wt ue91 meanz hou85 diffhou85 smplrat
		1 1 1 4 4123 2867 26881 -24014 .25
		2 1 4 4 760 2867 4896 -2029 .25
		3 1 5 4 721 2867 3730 -863 .25
		4 1 15 4 142 2867 556 2311 .25
		5 1 18 4 187 2867 1463 1404 .25
		6 1 26 4 331 2867 1946 921 .25
		7 1 30 4 127 2867 834 2033 .25
		8 1 31 4 219 2867 932 1935 .25" ,
		header = TRUE
	)

Add a column to the province data.frame object. The column (variable) fpc contains only the number 32 for every observation.

province$fpc <- 32

View the entire province data.frame object:

province

##   id str clu wt ue91 meanz hou85 diffhou85 smplrat fpc
## 1  1   1   1  4 4123  2867 26881    -24014    0.25  32
## 2  2   1   4  4  760  2867  4896     -2029    0.25  32
## 3  3   1   5  4  721  2867  3730      -863    0.25  32
## 4  4   1  15  4  142  2867   556      2311    0.25  32
## 5  5   1  18  4  187  2867  1463      1404    0.25  32
## 6  6   1  26  4  331  2867  1946       921    0.25  32
## 7  7   1  30  4  127  2867   834      2033    0.25  32
## 8  8   1  31  4  219  2867   932      1935    0.25  32

Construct a survey.design object called province.design specifying a simple random sampling design. This `province.design` object will be used for all subsequent analysis commands:

province.design <-
  svydesign(
		id = ~clu ,
    strata = ~str ,
		data = province ,
		weights = ~wt ,
		fpc = ~fpc
	)

From this point forward, the sampling specifications of the province data set’s survey design have been fixed and most analysis commands will simply use the set of tools outlined on the R survey package homepage, referring to the object province.design at the design= parameter of the specific R function or method.

Print the survey-weighted glm of ue91 and hou85 into a new object mysvyglm and at the same time print it to the screen by encapsulaing the entire statement in parentheses.

(mysvyglm <- svyglm(ue91 ~ hou85, province.design))

## Stratified Independent Sampling design
## svydesign(id = ~clu, strata = ~str, data = province, weights = ~wt, 
##     fpc = ~fpc)
## 
## Call:  svyglm(formula = ue91 ~ hou85, province.design)
## 
## Coefficients:
## (Intercept)        hou85  
##      42.655        0.152  
## 
## Degrees of Freedom: 7 Total (i.e. Null);  6 Residual
## Null Deviance:	    12900000 
## Residual Deviance: 22500 	AIC: 92.2

We view additional information about the model by running mysvyglm inside the summary function.

summary(mysvyglm)

## 
## Call:
## svyglm(formula = ue91 ~ hou85, province.design)
## 
## Survey design:
## svydesign(id = ~clu, strata = ~str, data = province, weights = ~wt, 
##     fpc = ~fpc)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.27e+01   2.05e+01    2.08    0.083 .  
## hou85       1.52e-01   7.17e-04  212.01  7.4e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3218)
## 
## Number of Fisher Scoring iterations: 2

We can specify the df.resid= parameter to use the survey design’s degrees of freedom (instead of the default df.resid=Inf) to replicate Stata’s P>|t| calculation method.

# matches stata
summary(mysvyglm, df.resid = degf(province.design))

## 
## Call:
## svyglm(formula = ue91 ~ hou85, province.design)
## 
## Survey design:
## svydesign(id = ~clu, strata = ~str, data = province, weights = ~wt, 
##     fpc = ~fpc)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.27e+01   2.05e+01    2.08    0.076 .  
## hou85       1.52e-01   7.17e-04  212.01  1.4e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3218)
## 
## Number of Fisher Scoring iterations: 2

Store the product of the sum of the survey weights (accessed from the original province data.frame) and the meanz column into a separate data.frame object.

hou85.df <- data.frame(hou85 = sum(province$wt) * 2867)

These numbers are used in the prediction that matches the result shown in the last line of Table 3.14 on page 107.

predict(mysvyglm, newdata = hou85.df, total = sum(province$wt))

##    link  SE
## 1 15311 600

For more detail regarding the usage of the predict function on survey-weighted data, type ?predict.svyglm into the console.

How can I do regression estimation with survey data? | R FAQ

Example

See also