As a statistical programming language, R allows users to access precise statistics instead of simply printing a mass of output to the screen. The examples below highlight how to create a complex sample survey design object and then directly query specific coefficients, error terms, and other survey design-related information as needed.

This page uses the following packages. Make sure that you can load
them before trying to run the examples on this page. If you do not have
a package installed, run: `install.packages("packagename")`

, or
if you see the version is out of date, run: `update.packages()`

.

library(foreign) library(survey)

**Version info: **Code for this page was tested in R version 3.0.1 (2013-05-16)
On: 2013-06-25
With: survey 3.29-5; foreign 0.8-54; knitr 1.2

## Example 1

This example is taken from Levy and Lemeshow’s Sampling of Populations page 53.

Import the Stata dataset directly into R using the `read.dta`

function from the `foreign`

package:

mydata <- read.dta( "https://stats.idre.ucla.edu/stat/books/sop/momsag.dta" , convert.factors = FALSE )

More detail about `read.dta`

or any other R function can be viewed by typing a question mark in front of the function name in the R console. For example: `?read.dta`

.

In the R language, individual data sets are stored as `data.frame`

objects, allowing users to load as many tables into working memory as necessary for the analysis. After loading the `mydata`

table into memory, R functions can be run directly on this data table.

# the `class` of the `mydata` object class(mydata)

## [1] "data.frame"

# the first six rows of the data head(mydata)

## hospno birth momsag weight1 momsag2 ## 1 13 773 0 30.9 1 ## 2 13 773 1 30.9 0 ## 3 13 773 1 30.9 0 ## 4 13 773 1 30.9 0 ## 5 13 773 1 30.9 0 ## 6 13 773 1 30.9 0

# the last six rows of the data tail(mydata)

## hospno birth momsag weight1 momsag2 ## 20 13 773 1 30.9 0 ## 21 13 773 1 30.9 0 ## 22 13 773 1 30.9 0 ## 23 13 773 1 30.9 0 ## 24 13 773 1 30.9 0 ## 25 13 773 1 30.9 0

# the number of columns ncol(mydata)

## [1] 5

View a simple tabulation of your variable of interest. This `table`

function accesses the `momsag`

column (variable) stored inside the `mydata`

data.frame object.

```
table(mydata$momsag)
```

## ## 0 1 ## 2 23

Initiate your `svydesign`

object for a **simple random sampling** design. This ``mydesign``

object will be used for all subsequent analysis commands:

```
mydesign <-
svydesign(
ids = ~1 ,
data = mydata ,
weights = ~weight1 ,
fpc = ~birth
)
```

From this point forward, the sampling specifications of the `mydata`

data set’s survey design have been fixed and most analysis commands will simply use the set of tools outlined on the R survey package homepage, referring to the object ``mydesign``

at the `design=`

parameter of the specific R function or method.

View the survey design structure:

mydesign

## Independent Sampling design ## svydesign(ids = ~1, data = mydata, weights = ~weight1, fpc = ~birth)

View the survey design’s object class or type:

```
class(mydesign)
```

## [1] "survey.design2" "survey.design"

View the weighted total population of this survey design, by referring to the `weight1`

column from the original data.frame:

```
sum(mydata$weight1)
```

## [1] 773

View the degrees of freedom for this survey design object:

```
degf(mydesign)
```

## [1] 24

Count the number of unweighted observations where the variable `momsag`

is not missing:

```
unwtd.count(~momsag, mydesign)
```

## counts SE ## counts 25 0

Print the mean and standard error of the `momsag`

variable:

```
svymean(~momsag, mydesign)
```

## mean SE ## momsag 0.92 0.05

Print the weighted total and standard error of the `momsag`

variable:

```
svytotal(~momsag, mydesign)
```

## total SE ## momsag 711 42.1

Alternatively, the result of a function call like `svymean`

or `svytotal`

can be stored into a secondary R object.

mysvymean <- svymean(~momsag, mydesign, deff = TRUE)

Once created, this `svymean`

can be queried independently from the `mydesign`

object. For example, the `coef`

and `SE`

functions can directly extract those attributes:

```
coef(mysvymean)
```

## momsag ## 0.92

```
SE(mysvymean)
```

## momsag ## momsag 0.0545

The design effect extraction function `deff`

can only be used if the original `svymean`

call that created the object `mysvymean`

included the parameter `deff = TRUE`

.

```
deff(mysvymean)
```

## momsag ## 1

Since design effect is a measurement of how much the survey design must be adjusted from simple random sampling, a **simple random sample** design should have a design effect of 1.

We can use the `confint`

function to obtain confidence intervals for the coefficient estimates.

```
confint(mysvymean)
```

## 2.5 % 97.5 % ## momsag 0.813 1.03

Note from our unweighted table, the variable `momsag`

was binary (composed strictly of zeroes and ones). However, the `confint`

function call above produced an interval with a higher-end greater than one. To produce confidence intervals more accurate for proportions, users might start with the options discussed in `?svyciprop`

. For example:

svyciprop(~momsag, mydesign, method = "logit")

## 2.5% 97.5% ## momsag 0.920 0.737 0.98

Also note that the number of decimal places shown can be adjusted by modifying the `digits`

parameter within the `options`

function at any time.

options(digits = 10) SE(mysvymean)

## momsag ## momsag 0.05447463617

## Example 2

This example is taken from Lehtonen and Pahkinen’s Practical Methods for Design and Analysis of Complex Surveys. page 29 Table 2.4. Estimates from a simple random sample drawn without replacement (n = 8); the Province’91 population.

Import the dataset from text directly into R using the `read.table`

function and the `text=`

parameter specifying the entire data set. The syntax `n`

indicates the end of one line of data.

```
province <-
read.table( text =
"id cluster ue91 lab91
1 1 4123 33786
2 4 760 5919
3 5 721 4930
4 15 142 675
5 18 187 1448
6 26 331 2543
7 30 127 1084
8 31 219 1330" ,
header = TRUE
)
```

Add two columns to the `province`

data.frame object. The columns (variables) `fpc`

and `weights`

contain only the numbers 32 and 4, respectively.

province$fpc <- 32 province$weights <- 4

View the entire `province`

data.frame object:

province

## id cluster ue91 lab91 fpc weights ## 1 1 1 4123 33786 32 4 ## 2 2 4 760 5919 32 4 ## 3 3 5 721 4930 32 4 ## 4 4 15 142 675 32 4 ## 5 5 18 187 1448 32 4 ## 6 6 26 331 2543 32 4 ## 7 7 30 127 1084 32 4 ## 8 8 31 219 1330 32 4

Construct a `survey.design`

object called `province.design`

specifying a **simple random sampling** design. This ``province.design``

object will be used for all subsequent analysis commands:

```
province.design <-
svydesign(
ids = ~1 ,
data = province ,
fpc = ~fpc ,
weights = ~weights
)
```

From this point forward, the sampling specifications of the `province`

data set’s survey design have been fixed and most analysis commands will simply use the set of tools outlined on the R survey package homepage, referring to the object `province.design`

at the `design=`

parameter of the specific R function or method.

View the survey design structure:

province.design

## Independent Sampling design ## svydesign(ids = ~1, data = province, fpc = ~fpc, weights = ~weights)

View the weighted total population of this survey design, by referring to the `weights`

column from the original data.frame:

```
sum(province$weights)
```

## [1] 32

Count the number of unweighted observations where the variable `ue91`

is not missing:

```
unwtd.count(~ue91, province.design)
```

## counts SE ## counts 8 0

Print the mean and standard error of the `ue91`

variable:

```
svymean(~ue91, province.design)
```

## mean SE ## ue91 826.25 415.07059

Print the weighted total and standard error of the `ue91`

variable:

```
svytotal(~ue91, province.design)
```

## total SE ## ue91 26440 13282.259

Save the ratio of `ue91`

to `lab91`

into a new object `myratio`

and at the same time print it to the screen by encapsulaing the entire statement in parentheses.

```
(myratio <- svyratio(~ue91, ~lab91, province.design))
```

## Ratio estimator: svyratio.survey.design2(~ue91, ~lab91, province.design) ## Ratios= ## lab91 ## ue91 0.1278159141 ## SEs= ## lab91 ## ue91 0.004087264606

We can then use the `confint`

function to obtain confidence intervals for the ratio.

```
confint(myratio)
```

## 2.5 % 97.5 % ## ue91/lab91 0.1198050227 0.1358268056

We can specify the `df=`

parameter to use the survey design’s degrees of freedom (instead of the default `df=Inf`

) to replicate Stata’s confidence interval calculation method.

# matches stata confint(myratio, df = degf(province.design))

## 2.5 % 97.5 % ## ue91/lab91 0.1181510691 0.1374807592

Print the median of the `ue91`

variable, including the confidence interval in the output.

svyquantile(~ue91, province.design, quantiles = 0.5, ci = TRUE)

## $quantiles ## 0.5 ## ue91 219 ## ## $CIs ## , , ue91 ## ## 0.5 ## (lower 133.5070715 ## upper) 743.0816141

## See also

- The R survey package homepage
- Lumley, T. Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology)
- Damico, A. Step-by-step instructions to analyze major public-use survey data sets with the R language