Applied Survey Data Analysis with R

Christine R. Wells, Ph.D.

Statistical Methods and Data Analytics

UCLA Office of Advanced Research Computing

Introduction

About this workshop

Assume the basics about using R
Assume the basics of data analysis
Introduction to complex survey data and how they are analyzed
R code and output are provided
Ask questions

Different kinds of data collection methods

Experiment: assume simple random sample (SRS) from a known population and equal weighting
Quasi-experiment: assume SRS
Online questionnaire (e.g., Survey Monkey, RedHat): convenience sample; weighting unknown (and possibly unknowable)
Complex survey data: not SRS, but possible to calculate sampling weight

Question for audience

Why not use simple random sampling?

How the data are collected

Sampling frame: listing of all elements of a population
First step: stratify or cluster?
- Stratification: each element belongs to exactly one stratum
Can use different sampling methods in each stratum

Drilling down

Primary sampling unit (PSU)
Secondary sampling unit (SSU)
Tertiary sampling unit (TSU)
Why are the SSU and TSU not included in the syntax/math?

More about selecting respondents

Over- and under-sampling
Generalization (know the population, e.g., NHANES)
Unit non-response (a big problem)
Sampling with and without replacement
Difference between complex survey data and data from a convenience sample: the sampling probability is known

Sampling weight

Difference between sampling weight and probability weight
Definition of probability weight: N/n
- N = population size
- n = sample size

Sampling weight continued

Multiply the probability weights at each stage
Corrections: unit non-response, trimming, raking, etc.
Sum to population size

Stratification

Purpose of stratification: theoretical and practical
Geographical and demographic
Must know some information about every element
Most effective when the variance of the dependent variable is smaller within the strata than in the sample as a whole
Textbook examples (stratification variable related to outcome variable) versus real data

Clustering

Clustering: necessary evil
How many PSUs per strata?
Singleton PSUs, consequences, and how to handle them

Denominator degrees of freedom

Calculated as (number of PSUs) - (number of strata)
30 - 15 = 15

Effects of survey elements

Changes the math used to calculate a value/quantity
Sampling weight: point estimate (e.g., mean, coefficient)
PSU and strata: variance estimates
(e.g., standard error and CI)
PSU and strata: theory verus reality

Effects of not using survey elements

No sampling weight: point estimates do not represent the population
No strata and/or PSU: standard errors artificially too small
- Type 1 errors
Cannot generalize beyond the sample

Sampling with and without replacement

Textbooks and math distinguish between with and without replacement
Real world: never sample with replacement
Math used in software: almost always assumes with replacement
- Why?

Questions for audience

What is the sampling weight used for?
What is the consequence of ignoring the sampling weight?
The PSU information is used in the calculation of what?

NHANES data

Getting data

https://wwwn.cdc.gov/nchs/nhanes/default.aspx
Read the documentation!
Get information on sampling, variables, etc.
Don’t need to read everything - be (a little) selective

What to read

Introduction
Sample design and analysis guidelines
Variance estimation
Missing data/imputation
Variable-specific documentation
Data user agreement

Using NHANES

Different data collection modes
- Questionnaire
- Exam
- Lab
- Limited access (not discussed here)
Each has a different sampling weight
Use the sampling weight associated with the data with smallest N
- Called “least common denominator”

Merge versus append

Merge when combining different datasets from same cycle (use seqn)
Append when combining datasets from different cycles
- Correct sampling weight
- Include variable coding for year or cycle
- Pay attention to documentation regarding pandemic

Where to get help?

https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/faq.aspx
https://wwwn.cdc.gov/nchs/nhanes/tutorials/default.aspx

R code

Libraries

library("haven")
library("survey")
library("jtools")
library("remotes")

Reading in the data

Read in your data file
NHANES data are distributed from the NHANES website as SAS .xpt files

Specifying the survey elements

nhc <- svydesign(id=~sdmvpsu, weights=~wtint2yr, strata=~sdmvstra, nest=TRUE, 
                 survey.lonely.psu = "adjust", data=nhanes2021)

Specifying the survey elements continued

summary(nhc)

Stratified 1 - level Cluster Sampling design (with replacement)
With (30) clusters.
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, 
    nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2021)
Probabilities:
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
5.849e-06 2.956e-05 4.615e-05 5.219e-05 6.978e-05 2.181e-04 
Stratum Sizes: 
           173 174 175 176 177 178 179 180 181 182 183 184 185 186 187
obs        837 832 790 744 832 797 821 797 799 799 696 957 774 676 782
design.PSU   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
actual.PSU   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
Data variables:
 [1] "dmdborn4" "dmdeduc2" "dmdhhsiz" "dmdhragz" "dmdhredz" "dmdhrgnd"
 [7] "dmdhrmaz" "dmdhsedz" "dmdmartz" "dmdyrusr" "dmqmiliz" "female"  
[13] "indfmpir" "ohq620"   "ohq630"   "ohq640"   "ohq660"   "ohq670"  
[19] "ohq680"   "ohq845"   "pad680"   "pad790q"  "pad790u"  "pad800"  
[25] "pad810q"  "pad810u"  "pad820"   "riagendr" "ridagemn" "ridageyr"
[31] "ridexagm" "ridexmon" "ridexprg" "ridreth1" "ridreth3" "ridstatr"
[37] "rxq510"   "rxq515"   "rxq520"   "sddsrvyr" "sdmvpsu"  "sdmvstra"
[43] "seqn"     "wtint2yr" "wtmec2yr"

Working with data before analysis

mean, min and max of sampling weight variable
check for missing values on weight, PSU and strata variables
do data management tasks (e.g., create new variables, recode variables, etc.)

The variables used in examples

ridageyr: Age in years; top coded at 80
female: 0 = male; 1 = female
dmdborn4: 0 = born elsewhere; 1 = born in US
dmdeduc2: 1 = less then 9th grade; 2 = no HS diploma; 3 = HS grad or GED; 4 = some college or AA degree; 5 = college grad or above
dmqmiliz: 0 = did not serve active duty in US Armed Forces; 1 = did serve active duty in US Armed Forces

The variables used in examples continued

dmdhrmaz: 1 = married; 2 = widowed/divorced; 3 = never married
ridexprg: 0 = not pregnant at exam; 1 = pregnant at exam
ohq620: How often last year had aching in mouth? 1 = very often; 2 = fairly often; 3 = occasionally; 4 = hardly ever; 5 = never
ohq630: How often felt bad because of mouth? 1 = very often; 2 = fairly often; 3 = occasionally; 4 = hardly ever; 5 = never
ohq845: Rate the health of your teeth and gums: 1 = excellent; 2 = very good; 3 = good; 4 = fair; 5 = poor

The variables used in examples continued

pad680: Minutes sedentary activity: 0 - 1380
pad800: Minutes moderate leisure time physical activity (LTPA): 1 - 720
pad820: Minutes vigorous leisure time physical activity (LTPA): 1 - 900
rxq510: Dr. told you to take daily low-dose aspirin? 0 = no; 1 = yes

Weighting in examples

Only using demographics and interview
No need to change weight variable
Not necessarily true with other analyses

Descriptive statistics

Descriptive statistics with continuous variables

Calculating means (documentation for svymean is found under surveysummary)

svymean(~ridageyr, design = nhc)

           mean     SE
ridageyr 38.989 0.5149

Calculating standard deviations

svysd(~ridageyr,design = nhc, na = TRUE)

         std. dev.
ridageyr    22.841

Means with missing data

sum(is.na(nhanes2021$ridageyr))

[1] 0

svymean(~ridageyr, nhc, na = TRUE)

           mean     SE
ridageyr 38.989 0.5149

sum(is.na(nhanes2021$ohq620))

[1] 204

svymean(~ohq620, nhc, na = TRUE)

         mean   SE
ohq620 4.1774 0.02

sum(is.na(nhanes2021$pad680))

[1] 3868

svymean(~pad680, nhc, na = TRUE)

         mean     SE
pad680 363.49 5.4261

Listwise deletion of missing cases across all variables

Notice that the means shown below do not match those seen before for ridageyr and ohq620
ridageyr = 38.99; ohq620 = 4.51; pad680 = 363.49

svymean(~ridageyr+ohq620+pad680, nhc, na = TRUE)

            mean     SE
ridageyr  47.711 0.4690
ohq620     4.126 0.0238
pad680   363.489 5.4395

A note about missing data

Do not delete cases with missing data!
Redefines the population in an unknowable way

Getting the mean, standard deviation and variance

svymean(~pad680, nhc, na = TRUE)

         mean     SE
pad680 363.49 5.4261

svysd(~pad680, design = nhc, na = TRUE)

       std. dev.
pad680    209.83

svyvar(~pad680, design = nhc, na = TRUE)

       variance     SE
pad680    44029 1321.9

Definition of Coefficient of Variation (CV)

Coefficient of variation: ratio of the standard error to the mean, multiplied by 100%
An indication of the variability relative to the mean in the population
Is not affected by the unit of measurement of the variable

Definition of Design Effect (DEFF)

Ratio of two variances
Numerator: variance estimate from the current sample (including all of its design elements)
Denominator: variance from a hypothetical sample of the same size drawn as an SRS

Definition of Design Effect (DEFF) continued

DEFF indicates how efficient your sample is compared to an SRS of equal size
DEFF less than 1: sample is more efficient than SRS
DEFF is almost always greater than 1 (due to clustering)

Getting the CV and DEFF

cv(svymean(~pad680, design = nhc, na = TRUE))

           pad680
pad680 0.01492782

svymean(~pad680, nhc, na = TRUE, deff = TRUE)

           mean       SE   DEff
pad680 363.4908   5.4261 5.3934

Getting quantiles

svyquantile(~ridageyr, design = nhc, na = TRUE, c(.25,.5,.75),ci=TRUE)

$ridageyr
     quantile ci.2.5 ci.97.5        se
0.25       19     18      21 0.7037464
0.5        38     37      40 0.7037464
0.75       58     57      60 0.7037464

attr(,"hasci")
[1] TRUE
attr(,"class")
[1] "newsvyquantile"

Getting totals

svytotal(~dmdborn4,design = nhc, na = TRUE)

             total       SE
dmdborn4 273897708 13781540

Getting the CV and DEFF for totals

cv(svytotal(~dmdborn4,design = nhc, na = TRUE))

           dmdborn4
dmdborn4 0.05031637

svytotal(~dmdborn4,design = nhc, na = TRUE, deff = TRUE)

             total        SE   DEff
dmdborn4 273897708  13781540 156.83

Means and proportions for binary variables

svymean(~female, nhc)

          mean     SE
female 0.50851 0.0068

Getting confidence intervals

Per documentation, use svyciprop rather than confint()
There are many options for method
Likelihood uses the (Rao-Scott) scaled chi-squared
Distribution for the loglikelihood from a binomial distribution

svyciprop(~I(female==1), nhc, method="likelihood")

                      2.5% 97.5%
I(female == 1) 0.509 0.494 0.523

Getting confidence intervals continued

li is short for likelihood

svyciprop(~I(female==0), nhc, method="li")

                      2.5% 97.5%
I(female == 0) 0.491 0.477 0.506

More on getting confidence intervals continued

Fits a logistic regression model and computes a Wald-type interval on the log-odds scale, which is then transformed to the probability scale

svyciprop(~I(female==1), nhc, method="logit")

                      2.5% 97.5%
I(female == 1) 0.509 0.494 0.523

More on getting confidence intervals continued

Uses a logit transformation of the mean and then back-transforms to the probability scale
Method used by SUDAAN and SPSS COMPLEX SAMPLES

svyciprop(~I(female==1), nhc, method="xlogit")

                      2.5% 97.5%
I(female == 1) 0.509 0.494 0.523

Descriptive statistics for categorical variables

Frequencies

svytable(~female, design = nhc)

female
        0         1 
160722480 166291071

Crosstabulations: 2-way tables

Method 1

svytable(~female+dmdborn4, nhc)

      dmdborn4
female         0         1
     0  25904267 134595325
     1  26772926 139302383

Crosstabulations: 2-way tables

Method 2

svytable(~interaction(female, dmdborn4), design = nhc)

interaction(female, dmdborn4)
      0.0       1.0       0.1       1.1 
 25904267  26772926 134595325 139302383

Crosstabulations: 3-way table

svytable(~interaction(female, dmdborn4, rxq510), design = nhc)

interaction(female, dmdborn4, rxq510)
   0.0.0    1.0.0    0.1.0    1.1.0    0.0.1    1.0.1    0.1.1    1.1.1 
11733993 12611137 36705490 46617656  3526721  3259081 23365237 20067392

Crosstabulations: 4-way table

Getting very difficult to read

svytable(~interaction(female, dmdborn4, ridexprg, dmdeduc2), design = nhc)

interaction(female, dmdborn4, ridexprg, dmdeduc2)
    0.0.0.1     1.0.0.1     0.1.0.1     1.1.0.1     0.0.1.1     1.0.1.1 
       0.00   824620.69        0.00    43684.05        0.00        0.00 
    0.1.1.1     1.1.1.1     0.0.0.2     1.0.0.2     0.1.0.2     1.1.0.2 
       0.00        0.00        0.00   861261.38        0.00  1122922.04 
    0.0.1.2     1.0.1.2     0.1.1.2     1.1.1.2     0.0.0.3     1.0.0.3 
       0.00        0.00        0.00    84449.45        0.00  2060327.67 
    0.1.0.3     1.1.0.3     0.0.1.3     1.0.1.3     0.1.1.3     1.1.1.3 
       0.00  5732701.94        0.00   146333.46        0.00   435114.57 
    0.0.0.4     1.0.0.4     0.1.0.4     1.1.0.4     0.0.1.4     1.0.1.4 
       0.00  1380815.39        0.00 10247378.16        0.00    71985.69 
    0.1.1.4     1.1.1.4     0.0.0.5     1.0.0.5     0.1.0.5     1.1.0.5 
       0.00   236404.64        0.00  3591233.65        0.00 13480603.96 
    0.0.1.5     1.0.1.5     0.1.1.5     1.1.1.5 
       0.00    59540.46        0.00   522743.95

Chi-square test

2 by 2

svychisq(~female+dmdborn4, nhc, statistic="adjWald")


    Design-based Wald test of association

data:  svychisq(~female + dmdborn4, nhc, statistic = "adjWald")
F = 0.00052674, ndf = 1, ddf = 15, p-value = 0.982

Chi-square test continued

5 by 2

svychisq(~dmdeduc2+dmqmiliz, nhc, statistic="adjWald")


    Design-based Wald test of association

data:  svychisq(~dmdeduc2 + dmqmiliz, nhc, statistic = "adjWald")
F = 11.015, ndf = 4, ddf = 12, p-value = 0.0005499

Try it yourself

Run the following code, and then try a few descriptive analyses (mostly categorical)

data(nhanes)
design <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, 
                    nest=TRUE, data=nhanes)
design

Stratified 1 - level Cluster Sampling design (with replacement)
With (31) clusters.
svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
    nest = TRUE, data = nhanes)

Graphing

Graphing of continuous variables: Histogram

Default is to show density on y-axis

svyhist(~pad680, nhc)

histogram of pad680 with density on y-axis

Histogram with count on y-axis

svyhist(~pad680, nhc, probability = FALSE)

histogram of pad680 with probability on y-axis

Boxplot

svyboxplot(~ridageyr~1, nhc, all.outliers=TRUE)

Boxplot broken into categories

The grouping variable must be a factor

svyboxplot(~ridageyr~factor(female), nhc, all.outliers=TRUE)

Barchart

barplt<-svyby(~ohq620+ohq630, ~female, nhc, na = TRUE, svymean)
barplot(barplt,beside=TRUE,legend=TRUE)

Dotplot

dotchart(barplt)

Scatterplot with weights as bubble size

svyplot(pad680~pad820, design = nhc, style="bubble")

scatterplot of pad680 by pad820 with bubbles sized by sampling weight

Five minute break

[ sleeping kitten indicating break time ]

Subpopulations

Analyzing only some cases

Analysis of survey data and experimental data is quite different
Cannot delete cases from a weighted dataset
Cases in the subpopulation are used to calculate point estimate
All cases are used to calculate the standard error
Can create a binary variable to specify the subpopulation

Examples with minutes of sedentary activity

No subpopulation

svymean(~pad680, nhc, na=TRUE)

         mean     SE
pad680 363.49 5.4261

Mean of pad680 broken out by the variable female

svyby(~pad680, ~female, nhc, svymean, na=TRUE)

  female  pad680       se
0      0 368.010 7.794388
1      1 359.229 4.203481

Using subset.survey.design to see the results for only one group

svymean(~pad680, subset(nhc, female == 1), na=TRUE)

         mean     SE
pad680 359.23 4.2035

Mean of pad680 by two variables

By low-dose aspirin and female

svyby(~pad680, ~rxq510+female, nhc, svymean, na=TRUE)

    rxq510 female   pad680        se
0.0      0      0 348.3941 10.147146
1.0      1      0 384.2072  8.794646
0.1      0      1 349.9134  6.862274
1.1      1      1 369.0892  9.403969

Three variables to define the subpopulation

Need the “na = TRUE”

svyby(~pad680, ~dmqmiliz+rxq510+female, nhc, na = TRUE, svymean)

      dmqmiliz rxq510 female   pad680        se
0.0.0        0      0      0 345.0032 11.938892
1.0.0        1      0      0 366.2002 16.668693
0.1.0        0      1      0 377.0661  9.220377
1.1.0        1      1      0 404.9886 13.913967
0.0.1        0      0      1 348.9575  7.288462
1.0.1        1      0      1 385.2752 35.287306
0.1.1        0      1      1 369.0723  9.474665
1.1.1        1      1      1 369.9402 30.816511

Subsetting to include only males

smale <- subset(nhc,female == 0)
summary(smale)

Stratified 1 - level Cluster Sampling design (with replacement)
With (30) clusters.
subset(nhc, female == 0)
Probabilities:
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
5.849e-06 2.822e-05 4.460e-05 5.073e-05 6.848e-05 2.181e-04 
Stratum Sizes: 
           173 174 175 176 177 178 179 180 181 182 183 184 185 186 187
obs        381 416 364 358 381 391 405 377 372 359 344 430 343 311 343
design.PSU   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
actual.PSU   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
Data variables:
 [1] "dmdborn4" "dmdeduc2" "dmdhhsiz" "dmdhragz" "dmdhredz" "dmdhrgnd"
 [7] "dmdhrmaz" "dmdhsedz" "dmdmartz" "dmdyrusr" "dmqmiliz" "female"  
[13] "indfmpir" "ohq620"   "ohq630"   "ohq640"   "ohq660"   "ohq670"  
[19] "ohq680"   "ohq845"   "pad680"   "pad790q"  "pad790u"  "pad800"  
[25] "pad810q"  "pad810u"  "pad820"   "riagendr" "ridagemn" "ridageyr"
[31] "ridexagm" "ridexmon" "ridexprg" "ridreth1" "ridreth3" "ridstatr"
[37] "rxq510"   "rxq515"   "rxq520"   "sddsrvyr" "sdmvpsu"  "sdmvstra"
[43] "seqn"     "wtint2yr" "wtmec2yr"

svymean(~pad680, design=smale, na = TRUE)

         mean     SE
pad680 368.01 7.7944

Try it yourself

Subset on agecat; run a chi-square test

data(nhanes)
design <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, 
                    nest=TRUE, data=nhanes)
design

Stratified 1 - level Cluster Sampling design (with replacement)
With (31) clusters.
svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
    nest = TRUE, data = nhanes)

t-tests

t-tests: one sample t-test

Minutes of sedentary activity

svyttest(pad680~0, nhc, na = TRUE)


    Design-based one-sample t-test

data:  pad680 ~ 0
t = 66.989, df = 14, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 351.8529 375.1287
sample estimates:
    mean 
363.4908

t-tests: paired samples t-test

Aching in mouth and felt bad because of mouth

svymean(~ohq630+ohq620, nhc, na = TRUE)

         mean     SE
ohq630 4.5052 0.0190
ohq620 4.1281 0.0237

svyttest(I(ohq630-ohq620)~0, nhc, na = TRUE)


    Design-based one-sample t-test

data:  I(ohq630 - ohq620) ~ 0
t = 23.107, df = 14, p-value = 1.506e-12
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 0.3421406 0.4121551
sample estimates:
     mean 
0.3771479

t-tests: two-sample t-tests

svyttest(ridageyr~female, nhc)


    Design-based t-test

data:  ridageyr ~ female
t = 6.365, df = 14, p-value = 1.754e-05
alternative hypothesis: true difference in mean is not equal to 0
95 percent confidence interval:
 1.148072 2.315009
sample estimates:
difference in mean 
          1.731541

t-tests continued

Calculating the difference by hand
The matrix created in the svyby call is needed for the svycontrast function

summary(svyglm(ridageyr~female, design=nhc))


Call:
svyglm(formula = ridageyr ~ female, design = nhc)

Survey design:
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, 
    nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2021)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  38.1088     0.5197  73.327  < 2e-16 ***
female        1.7315     0.2720   6.365 1.75e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 520.9637)

Number of Fisher Scoring iterations: 2

t-tests continued

a <- svyby(~ridageyr, ~female, nhc, na.rm.by = TRUE, svymean, covmat = TRUE)
vcov(a)

          0         1
0 0.2700972 0.2475816
1 0.2475816 0.2990718

  female ridageyr        se
0      0 38.10880 0.5197088
1      1 39.84034 0.5468746

t-tests continued

Calculating the difference between two means

svycontrast(a, c( -1, 1))

         contrast    SE
contrast   1.7315 0.272

39.84034 - 38.10880

[1] 1.73154

t-tests continued

Another example, this time using svypredmeans

summary(svyglm(ridageyr~female, design = nhc))


Call:
svyglm(formula = ridageyr ~ female, design = nhc)

Survey design:
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, 
    nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2021)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  38.1088     0.5197  73.327  < 2e-16 ***
female        1.7315     0.2720   6.365 1.75e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 520.9637)

Number of Fisher Scoring iterations: 2

Predicted means from t-test

Can’t have female in the model when you want to get the predicted means for female

ttest1 <- (svyglm(ridageyr ~1, design = nhc))
summary(ttest1)


Call:
svyglm(formula = ridageyr ~ 1, design = nhc)

Survey design:
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, 
    nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2021)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  38.9893     0.5149   75.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 521.7131)

Number of Fisher Scoring iterations: 2

Predicted means from t-test continued

svypredmeans(ttest1, ~factor(female))

    mean     SE
0 38.109 0.5197
1 39.840 0.5469

t-test with a subpopulation

svyttest(ridageyr~female, nhc)


    Design-based t-test

data:  ridageyr ~ female
t = 6.365, df = 14, p-value = 1.754e-05
alternative hypothesis: true difference in mean is not equal to 0
95 percent confidence interval:
 1.148072 2.315009
sample estimates:
difference in mean 
          1.731541

Question for audience

What do you call a linear regression with a single dichotomous predictor?

Answer

A t-test!

summary(svyglm(pad680~female, design=nhc, na.action = na.omit))


Call:
svyglm(formula = pad680 ~ female, design = nhc, na.action = na.omit)

Survey design:
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, 
    nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2021)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  368.010      7.794  47.215   <2e-16 ***
female        -8.781      6.102  -1.439    0.172    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 49964.36)

Number of Fisher Scoring iterations: 2

svyttest(pad680~female, subset(nhc, dmqmiliz == 0 ) )


    Design-based t-test

data:  pad680 ~ female
t = -0.90362, df = 14, p-value = 0.3815
alternative hypothesis: true difference in mean is not equal to 0
95 percent confidence interval:
 -22.688225   9.237606
sample estimates:
difference in mean 
          -6.72531

Linear regression

Multiple linear regression: main effects only

summary(svyglm(pad680~female+pad800, design=nhc, na.action = na.omit))


Call:
svyglm(formula = pad680 ~ female + pad800, design = nhc, na.action = na.omit)

Survey design:
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, 
    nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2021)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 389.6686    11.6894  33.335 5.61e-14 ***
female      -18.1188     7.7054  -2.351   0.0351 *  
pad800       -0.3272     0.0575  -5.690 7.42e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 46878.07)

Number of Fisher Scoring iterations: 2

Interpretation

Just the same as with an unweighted model
For a one unit change in the predictor (e.g., female), the expected change in the outcome is the coefficient, holding all other variables in the model constant
All coefficients control for all other coefficients

Multiple linear regression with an interaction term

summary(svyglm(pad680~female*pad800, design = nhc, na.action = na.omit))


Call:
svyglm(formula = pad680 ~ female * pad800, design = nhc, na.action = na.omit)

Survey design:
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, 
    nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2021)

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   393.25805   13.00529  30.238 1.07e-12 ***
female        -25.72500   10.84002  -2.373 0.035199 *  
pad800         -0.37573    0.07615  -4.934 0.000346 ***
female:pad800   0.12010    0.07645   1.571 0.142160    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 46862.61)

Number of Fisher Scoring iterations: 2

Multiple linear regression with an interaction term continued

glm1 <- (svyglm(pad680~female*pad800, design = nhc, na.action = na.omit))
glm1

Stratified 1 - level Cluster Sampling design (with replacement)
With (30) clusters.
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, 
    nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2021)

Call:  svyglm(formula = pad680 ~ female * pad800, design = nhc, na.action = na.omit)

Coefficients:
  (Intercept)         female         pad800  female:pad800  
     393.2581       -25.7250        -0.3757         0.1201  

Degrees of Freedom: 6339 Total (i.e. Null);  12 Residual
  (5593 observations deleted due to missingness)
Null Deviance:      300500000 
Residual Deviance: 297100000    AIC: 86400

Multiple linear regression: getting confidence intervals

confint(glm1)

                     2.5 %      97.5 %
(Intercept)   364.92195664 421.5941442
female        -49.34337506  -2.1066203
pad800         -0.54164749  -0.2098056
female:pad800  -0.04646547   0.2866697

Multiple liner regression: two categorical predictors

summary(svyglm(pad680~factor(female)*factor(dmdhrmaz), design=nhc, 
               na.action = na.omit))


Call:
svyglm(formula = pad680 ~ factor(female) * factor(dmdhrmaz), 
    design = nhc, na.action = na.omit)

Survey design:
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, 
    nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2021)

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        396.866     17.316  22.919 5.64e-10 ***
factor(female)1                    -17.001     16.670  -1.020   0.3318    
factor(dmdhrmaz)2                  -70.221     34.156  -2.056   0.0668 .  
factor(dmdhrmaz)3                   -5.949     49.498  -0.120   0.9067    
factor(female)1:factor(dmdhrmaz)2   57.922     49.112   1.179   0.2655    
factor(female)1:factor(dmdhrmaz)3  -63.635     63.464  -1.003   0.3397    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 25424.33)

Number of Fisher Scoring iterations: 2

AIC and BIC

AIC(glm1, null_has_intercept = TRUE)

       eff.p          AIC     deltabar 
    6.362728 99751.045451     1.590682

BIC(glm1, maximal = glm1)

        p       BIC      neff 
        4 297062151       NaN

Predicted margins

ols1 <- (svyglm(pad680~1, design=nhc, na.action = na.omit))
predmarg<-svypredmeans(ols1, ~interaction(female,dmdhrmaz))
predmarg

      mean     SE
0.1 396.87 17.316
1.1 379.86 18.002
0.2 326.64 26.614
1.3 310.28 47.484
1.2 367.57 39.700
0.3 390.92 47.087

Non-parametric tests

Non-parametric tests: Wilcoxon signed rank test

Non-parametric version of a matched pairs t-test

wil <- svyranktest(ohq630~ohq620, design = nhc, na = TRUE, test = c("wilcoxon"))
wil


    Design-based KruskalWallis test

data:  ohq630 ~ ohq620
df = 4, Chisq = 1811.4, p-value = 4.026e-12

Non-parametric tests: Median test

Non-parametric version of a independent samples t-test

mtest <- svyranktest(pad680~female, design = nhc, na = TRUE, test=("median"))
mtest


    Design-based median test

data:  pad680 ~ female
t = -0.34042, df = 14, p-value = 0.7386
alternative hypothesis: true difference in mean rank score is not equal to 0
sample estimates:
difference in mean rank score 
                 -0.005688636

Non-parametric tests: Kruskal Wallis test

Non-parametric version of an ANOVA

kwtest <- svyranktest(ohq620~dmdmartz, design = nhc, na = TRUE, 
                      test=("KruskalWallis"))
kwtest


    Design-based KruskalWallis test

data:  ohq620 ~ dmdmartz
df = 4, Chisq = 24.281, p-value = 0.007872

Logistic regression

Create a binary variable

Creating a binary outcome variable

nhanes2021$tghealth = NA
nhanes2021$tghealth[nhanes2021$ohq845 == 1] = 1
nhanes2021$tghealth[nhanes2021$ohq845 == 2] = 1
nhanes2021$tghealth[nhanes2021$ohq845 == 3] = 0
nhanes2021$tghealth[nhanes2021$ohq845 == 4] = 0
nhanes2021$tghealth[nhanes2021$ohq845 == 5] = 0

Need to reissue the svydesign command

nhc1 <- svydesign(id=~sdmvpsu, weights=~wtint2yr,strata=~sdmvstra, nest=TRUE, 
                 survey.lonely.psu = "adjust", data=nhanes2021)

Logistic regression

logit1 <- (svyglm(tghealth~factor(dmdeduc2)+ridageyr, family=quasibinomial, 
                  design=nhc1, na.action = na.omit))
summary(logit1)


Call:
svyglm(formula = tghealth ~ factor(dmdeduc2) + ridageyr, design = nhc1, 
    family = quasibinomial, na.action = na.omit)

Survey design:
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, 
    nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2021)

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1.250701   0.220296  -5.677 0.000205 ***
factor(dmdeduc2)2  0.095245   0.218273   0.436 0.671848    
factor(dmdeduc2)3  0.621578   0.185369   3.353 0.007326 ** 
factor(dmdeduc2)4  0.975127   0.173591   5.617 0.000222 ***
factor(dmdeduc2)5  1.795251   0.193574   9.274 3.16e-06 ***
ridageyr          -0.005801   0.001958  -2.963 0.014219 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 1.147022)

Number of Fisher Scoring iterations: 4

Interpretation

Just the same as with an unweighted model
For a one unit change in the predictor (e.g., female), the expected change in the predicted log odds of the outcome is the coefficient, holding all other variables in the model constant
All coefficients control for all other coefficients

Logistic regression using a subpopulation

subset1 <- subset(nhc1, ridageyr > 20)
logit2 <- (svyglm(tghealth~factor(dmdeduc2)+ridageyr, 
                  family=quasibinomial, design=subset1, na.action = na.omit))
summary(logit2)


Call:
svyglm(formula = tghealth ~ factor(dmdeduc2) + ridageyr, design = subset1, 
    family = quasibinomial, na.action = na.omit)

Survey design:
subset(nhc1, ridageyr > 20)

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1.285928   0.217597  -5.910 0.000149 ***
factor(dmdeduc2)2  0.074280   0.207288   0.358 0.727538    
factor(dmdeduc2)3  0.601982   0.191304   3.147 0.010391 *  
factor(dmdeduc2)4  0.974348   0.175175   5.562 0.000240 ***
factor(dmdeduc2)5  1.801740   0.191866   9.391 2.82e-06 ***
ridageyr          -0.005143   0.001891  -2.720 0.021576 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 1.001688)

Number of Fisher Scoring iterations: 4

Logistic regression: getting pseudo-R-square values

psrsq(logit2, method = c("Cox-Snell"))

[1] 0.07726501

psrsq(logit2, method = c("Nagelkerke"))

[1] 0.1046439

Try it yourself

Use HI_CHOL as the outcome and any of the other variables as predictors

data(nhanes)
design <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, 
                    nest=TRUE, data=nhanes)
design

Stratified 1 - level Cluster Sampling design (with replacement)
With (31) clusters.
svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
    nest = TRUE, data = nhanes)

Ordered logistic

ologit1 <- svyolr(factor(dmdeduc2)~factor(female)+factor(dmdborn4)+pad680, 
                  design = nhc, method = c("logistic"))
summary(ologit1)

Call:
svyolr(factor(dmdeduc2) ~ factor(female) + factor(dmdborn4) + 
    pad680, design = nhc, method = c("logistic"))

Coefficients:
                        Value   Std. Error   t value
factor(female)1   0.169926364 0.0472898498  3.593295
factor(dmdborn4)1 0.234984893 0.1373352880  1.711031
pad680            0.001808744 0.0001728276 10.465596

Intercepts:
    Value    Std. Error t value 
1|2  -2.4906   0.2214   -11.2488
2|3  -1.3229   0.1911    -6.9208
3|4   0.3635   0.1564     2.3249
4|5   1.6192   0.1773     9.1339
(4226 observations deleted due to missingness)

Count model

Poisson model

summary(svyglm(pad800~female, design=nhc, family=poisson()))


Call:
svyglm(formula = pad800 ~ female, design = nhc, family = poisson())

Survey design:
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, 
    nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2021)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.30368    0.02881 149.376  < 2e-16 ***
female      -0.27657    0.02811  -9.839 1.14e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 67.8306)

Number of Fisher Scoring iterations: 5

Other things

Weighted Principle Components Analysis

pc <- svyprcomp(~pad680+ohq620+ohq630, design=nhc,scale=TRUE,scores=TRUE)
pc

Standard deviations (1, .., p=3):
[1] 1.4538469 0.8541645 0.3958942

Rotation (n x k) = (3 x 3):
              PC1        PC2        PC3
pad680 -0.4433069 -0.8948138 -0.0527938
ohq620 -0.6281802  0.3521469 -0.6938171
ohq630 -0.6394283  0.2744099  0.7182135

PCA plot

biplot(pc, weight="scaled")

Cronbach’s alpha

svycralpha(~dmdhrmaz+dmdborn4, design=nhc, na.rm = TRUE)

   *alpha* 
0.06773242

Hot off the press!

Thomas Lumley’s newest package: vgam (Vector Generalised Linear Models)
- https://cran.r-project.org/web/packages/svyVGAM/index.html
- proportional odds model (ordinal logistic regression)
- different algorithm than used by svyolr()
- zero-inflated Poisson
- multinomial logistic regression

Wrapping up

More information on NHANES data

Helpful resources
- Listserv: http://www.cdc.gov/nchs/nhanes/nhanes_listserv.htm
- Online tutorials: http://www.cdc.gov/nchs/tutorials/index.htm
- NHANES data are also available from IPUMS

Questions????

[ slide asking for questions ]

References

Applied Survey Data Analysis by Steven G. Heeringa, Brady T. West, and Patricia A. Berglund
Analysis of Health Surveys by Edward L. Korn and Barry I. Graubard
Sampling of Populations: Methods and Applications, Fourth Edition by Paul Levy and Stanley Lemeshow
Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner

References continued

Sampling Techniques, Third Edition by William G. Cochran. 1977.
Pfeffermann, D., Skinner, C. J., Holmes, D. J., Goldstein, H., and Rasbash, J. (1998), Weighting for Unequal Selection Probabilities in Multilevel Models, Journal of the Royal Statistical Society, Series B, 60, 23-40.
Rabe-Hesketh, S. and Skrondal, A. (2006). Multilevel Modelling of Complex Survey Data, Journal of the Royal Statistical Society, Series A, 169, 805-827.

References continued

Carle, Adam C. (2009). Fitting Multilevel Models in Complex Survey Data with Design Weights: Recommendations. BMC Medical Research Methodology; 9(49).
Quartagno, M., Carpenter, R., and Goldstein, H. (2019). Multiple Imputation with Survey Weights: A Multilevel Approach. Journal of Survey Statistics and Methodology, Volume 8, Issue 5, November 2020, Pages 965–989, https://doi.org/10.1093/jssam/smz036

Thank you!!

[ slide thanking audience for attending ]