Applied Survey Data Analysis with R

Christine R. Wells, Ph.D.

Statistical Methods and Data Analytics

UCLA Office of Advanced Research Computing

Introduction

About this workshop

  • Assume the basics about using R
  • Assume the basics of data analysis
  • Introduction to complex survey data and how they are analyzed
  • R code and output are provided
  • Ask questions

Different kinds of data collection methods

  • Experiment: assume simple random sample (SRS) from a known population and equal weighting
  • Quasi-experiment: assume SRS
  • Online questionnaire (e.g., Survey Monkey, RedHat): convenience sample; weighting unknown (and possibly unknowable)
  • Complex survey data: not SRS, but possible to calculate sampling weight

Question for audience

  • Why not use simple random sampling?

How the data are collected

  • Sampling frame: listing of all elements of a population
  • First step: stratify or cluster?
    • Stratification: each element belongs to exactly one stratum
  • Can use different sampling methods in each stratum

Drilling down

  • Primary sampling unit (PSU)
  • Secondary sampling unit (SSU)
  • Tertiary sampling unit (TSU)
  • Why are the SSU and TSU not included in the syntax/math?

More about selecting respondents

  • Over- and under-sampling
  • Generalization (know the population, e.g., NHANES)
  • Unit non-response (a big problem)
  • Sampling with and without replacement
  • Difference between complex survey data and data from a convenience sample: the sampling probability is known

Sampling weight

  • Difference between sampling weight and probability weight
  • Definition of probability weight: N/n
    • N = population size
    • n = sample size

Sampling weight continued

  • Multiply the probability weights at each stage
  • Corrections: unit non-response, trimming, raking, etc.
  • Sum to population size

Stratification

  • Purpose of stratification: theoretical and practical
  • Geographical and demographic
  • Must know some information about every element
  • Most effective when the variance of the dependent variable is smaller within the strata than in the sample as a whole
  • Textbook examples (stratification variable related to outcome variable) versus real data

Clustering

  • Clustering: necessary evil
  • How many PSUs per strata?
  • Singleton PSUs, consequences, and how to handle them

Denominator degrees of freedom

  • Calculated as (number of PSUs) - (number of strata)
  • 30 - 15 = 15

Effects of survey elements

  • Changes the math used to calculate a value/quantity
  • Sampling weight: point estimate (e.g., mean, coefficient)
  • PSU and strata: variance estimates
    (e.g., standard error and CI)
  • PSU and strata: theory verus reality

Effects of not using survey elements

  • No sampling weight: point estimates do not represent the population
  • No strata and/or PSU: standard errors artificially too small
    • Type 1 errors
  • Cannot generalize beyond the sample

Sampling with and without replacement

  • Textbooks and math distinguish between with and without replacement
  • Real world: never sample with replacement
  • Math used in software: almost always assumes with replacement
    • Why?

Questions for audience

  1. What is the sampling weight used for?
  2. What is the consequence of ignoring the sampling weight?
  3. The PSU information is used in the calculation of what?

NHANES data

Getting data

  • https://wwwn.cdc.gov/nchs/nhanes/default.aspx
  • Read the documentation!
  • Get information on sampling, variables, etc.
  • Don’t need to read everything - be (a little) selective

What to read

  • Introduction
  • Sample design and analysis guidelines
  • Variance estimation
  • Missing data/imputation
  • Variable-specific documentation
  • Data user agreement

Using NHANES

  • Different data collection modes
    • Questionnaire
    • Exam
    • Lab
    • Limited access (not discussed here)
  • Each has a different sampling weight
  • Use the sampling weight associated with the data with smallest N
    • Called “least common denominator”

Merge versus append

  • Merge when combining different datasets from same cycle (use seqn)
  • Append when combining datasets from different cycles
    • Correct sampling weight
    • Include variable coding for year or cycle
    • Pay attention to documentation regarding pandemic

Where to get help?

  • https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/faq.aspx
  • https://wwwn.cdc.gov/nchs/nhanes/tutorials/default.aspx

R code

Libraries

library("haven")
library("survey")
library("jtools")
library("remotes")

Reading in the data

  • Read in your data file
  • NHANES data are distributed from the NHANES website as SAS .xpt files

Specifying the survey elements

nhc <- svydesign(id=~sdmvpsu, weights=~wtint2yr, strata=~sdmvstra, nest=TRUE, 
                 survey.lonely.psu = "adjust", data=nhanes2021)

Specifying the survey elements continued

summary(nhc)
Stratified 1 - level Cluster Sampling design (with replacement)
With (30) clusters.
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, 
    nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2021)
Probabilities:
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
5.849e-06 2.956e-05 4.615e-05 5.219e-05 6.978e-05 2.181e-04 
Stratum Sizes: 
           173 174 175 176 177 178 179 180 181 182 183 184 185 186 187
obs        837 832 790 744 832 797 821 797 799 799 696 957 774 676 782
design.PSU   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
actual.PSU   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
Data variables:
 [1] "dmdborn4" "dmdeduc2" "dmdhhsiz" "dmdhragz" "dmdhredz" "dmdhrgnd"
 [7] "dmdhrmaz" "dmdhsedz" "dmdmartz" "dmdyrusr" "dmqmiliz" "female"  
[13] "indfmpir" "ohq620"   "ohq630"   "ohq640"   "ohq660"   "ohq670"  
[19] "ohq680"   "ohq845"   "pad680"   "pad790q"  "pad790u"  "pad800"  
[25] "pad810q"  "pad810u"  "pad820"   "riagendr" "ridagemn" "ridageyr"
[31] "ridexagm" "ridexmon" "ridexprg" "ridreth1" "ridreth3" "ridstatr"
[37] "rxq510"   "rxq515"   "rxq520"   "sddsrvyr" "sdmvpsu"  "sdmvstra"
[43] "seqn"     "wtint2yr" "wtmec2yr"

Working with data before analysis

  • mean, min and max of sampling weight variable
  • check for missing values on weight, PSU and strata variables
  • do data management tasks (e.g., create new variables, recode variables, etc.)

The variables used in examples

  • ridageyr: Age in years; top coded at 80
  • female: 0 = male; 1 = female
  • dmdborn4: 0 = born elsewhere; 1 = born in US
  • dmdeduc2: 1 = less then 9th grade; 2 = no HS diploma; 3 = HS grad or GED; 4 = some college or AA degree; 5 = college grad or above
  • dmqmiliz: 0 = did not serve active duty in US Armed Forces; 1 = did serve active duty in US Armed Forces

The variables used in examples continued

  • dmdhrmaz: 1 = married; 2 = widowed/divorced; 3 = never married
  • ridexprg: 0 = not pregnant at exam; 1 = pregnant at exam
  • ohq620: How often last year had aching in mouth? 1 = very often; 2 = fairly often; 3 = occasionally; 4 = hardly ever; 5 = never
  • ohq630: How often felt bad because of mouth? 1 = very often; 2 = fairly often; 3 = occasionally; 4 = hardly ever; 5 = never
  • ohq845: Rate the health of your teeth and gums: 1 = excellent; 2 = very good; 3 = good; 4 = fair; 5 = poor

The variables used in examples continued

  • pad680: Minutes sedentary activity: 0 - 1380
  • pad800: Minutes moderate leisure time physical activity (LTPA): 1 - 720
  • pad820: Minutes vigorous leisure time physical activity (LTPA): 1 - 900
  • rxq510: Dr. told you to take daily low-dose aspirin? 0 = no; 1 = yes

Weighting in examples

  • Only using demographics and interview
  • No need to change weight variable
  • Not necessarily true with other analyses

Descriptive statistics

Descriptive statistics with continuous variables

  • Calculating means (documentation for svymean is found under surveysummary)
svymean(~ridageyr, design = nhc)
           mean     SE
ridageyr 38.989 0.5149
  • Calculating standard deviations
svysd(~ridageyr,design = nhc, na = TRUE)
         std. dev.
ridageyr    22.841

Means with missing data

sum(is.na(nhanes2021$ridageyr))
[1] 0
svymean(~ridageyr, nhc, na = TRUE)
           mean     SE
ridageyr 38.989 0.5149
sum(is.na(nhanes2021$ohq620))
[1] 204
svymean(~ohq620, nhc, na = TRUE)
         mean   SE
ohq620 4.1774 0.02
sum(is.na(nhanes2021$pad680))
[1] 3868
svymean(~pad680, nhc, na = TRUE)
         mean     SE
pad680 363.49 5.4261

Listwise deletion of missing cases across all variables

  • Notice that the means shown below do not match those seen before for ridageyr and ohq620
  • ridageyr = 38.99; ohq620 = 4.51; pad680 = 363.49
svymean(~ridageyr+ohq620+pad680, nhc, na = TRUE)
            mean     SE
ridageyr  47.711 0.4690
ohq620     4.126 0.0238
pad680   363.489 5.4395

A note about missing data

  • Do not delete cases with missing data!
  • Redefines the population in an unknowable way

Getting the mean, standard deviation and variance

svymean(~pad680, nhc, na = TRUE)
         mean     SE
pad680 363.49 5.4261
svysd(~pad680, design = nhc, na = TRUE)
       std. dev.
pad680    209.83
svyvar(~pad680, design = nhc, na = TRUE)
       variance     SE
pad680    44029 1321.9

Definition of Coefficient of Variation (CV)

  • Coefficient of variation: ratio of the standard error to the mean, multiplied by 100%
  • An indication of the variability relative to the mean in the population
  • Is not affected by the unit of measurement of the variable

Definition of Design Effect (DEFF)

  • Ratio of two variances
  • Numerator: variance estimate from the current sample (including all of its design elements)
  • Denominator: variance from a hypothetical sample of the same size drawn as an SRS

Definition of Design Effect (DEFF) continued

  • DEFF indicates how efficient your sample is compared to an SRS of equal size
  • DEFF less than 1: sample is more efficient than SRS
  • DEFF is almost always greater than 1 (due to clustering)

Getting the CV and DEFF

cv(svymean(~pad680, design = nhc, na = TRUE))
           pad680
pad680 0.01492782
svymean(~pad680, nhc, na = TRUE, deff = TRUE)
           mean       SE   DEff
pad680 363.4908   5.4261 5.3934

Getting quantiles

svyquantile(~ridageyr, design = nhc, na = TRUE, c(.25,.5,.75),ci=TRUE)
$ridageyr
     quantile ci.2.5 ci.97.5        se
0.25       19     18      21 0.7037464
0.5        38     37      40 0.7037464
0.75       58     57      60 0.7037464

attr(,"hasci")
[1] TRUE
attr(,"class")
[1] "newsvyquantile"

Getting totals

svytotal(~dmdborn4,design = nhc, na = TRUE)
             total       SE
dmdborn4 273897708 13781540

Getting the CV and DEFF for totals

cv(svytotal(~dmdborn4,design = nhc, na = TRUE))
           dmdborn4
dmdborn4 0.05031637
svytotal(~dmdborn4,design = nhc, na = TRUE, deff = TRUE)
             total        SE   DEff
dmdborn4 273897708  13781540 156.83

Means and proportions for binary variables

svymean(~female, nhc)
          mean     SE
female 0.50851 0.0068

Getting confidence intervals

  • Per documentation, use svyciprop rather than confint()
  • There are many options for method
  • Likelihood uses the (Rao-Scott) scaled chi-squared
  • Distribution for the loglikelihood from a binomial distribution
svyciprop(~I(female==1), nhc, method="likelihood")
                      2.5% 97.5%
I(female == 1) 0.509 0.494 0.523

Getting confidence intervals continued

  • li is short for likelihood
svyciprop(~I(female==0), nhc, method="li")
                      2.5% 97.5%
I(female == 0) 0.491 0.477 0.506

More on getting confidence intervals continued

  • Fits a logistic regression model and computes a Wald-type interval on the log-odds scale, which is then transformed to the probability scale
svyciprop(~I(female==1), nhc, method="logit")
                      2.5% 97.5%
I(female == 1) 0.509 0.494 0.523

More on getting confidence intervals continued

  • Uses a logit transformation of the mean and then back-transforms to the probability scale
  • Method used by SUDAAN and SPSS COMPLEX SAMPLES
svyciprop(~I(female==1), nhc, method="xlogit")
                      2.5% 97.5%
I(female == 1) 0.509 0.494 0.523

Descriptive statistics for categorical variables

  • Frequencies
svytable(~female, design = nhc)
female
        0         1 
160722480 166291071 

Crosstabulations: 2-way tables

  • Method 1
svytable(~female+dmdborn4, nhc)
      dmdborn4
female         0         1
     0  25904267 134595325
     1  26772926 139302383

Crosstabulations: 2-way tables

  • Method 2
svytable(~interaction(female, dmdborn4), design = nhc)
interaction(female, dmdborn4)
      0.0       1.0       0.1       1.1 
 25904267  26772926 134595325 139302383 

Crosstabulations: 3-way table

svytable(~interaction(female, dmdborn4, rxq510), design = nhc)
interaction(female, dmdborn4, rxq510)
   0.0.0    1.0.0    0.1.0    1.1.0    0.0.1    1.0.1    0.1.1    1.1.1 
11733993 12611137 36705490 46617656  3526721  3259081 23365237 20067392 

Crosstabulations: 4-way table

  • Getting very difficult to read
svytable(~interaction(female, dmdborn4, ridexprg, dmdeduc2), design = nhc)
interaction(female, dmdborn4, ridexprg, dmdeduc2)
    0.0.0.1     1.0.0.1     0.1.0.1     1.1.0.1     0.0.1.1     1.0.1.1 
       0.00   824620.69        0.00    43684.05        0.00        0.00 
    0.1.1.1     1.1.1.1     0.0.0.2     1.0.0.2     0.1.0.2     1.1.0.2 
       0.00        0.00        0.00   861261.38        0.00  1122922.04 
    0.0.1.2     1.0.1.2     0.1.1.2     1.1.1.2     0.0.0.3     1.0.0.3 
       0.00        0.00        0.00    84449.45        0.00  2060327.67 
    0.1.0.3     1.1.0.3     0.0.1.3     1.0.1.3     0.1.1.3     1.1.1.3 
       0.00  5732701.94        0.00   146333.46        0.00   435114.57 
    0.0.0.4     1.0.0.4     0.1.0.4     1.1.0.4     0.0.1.4     1.0.1.4 
       0.00  1380815.39        0.00 10247378.16        0.00    71985.69 
    0.1.1.4     1.1.1.4     0.0.0.5     1.0.0.5     0.1.0.5     1.1.0.5 
       0.00   236404.64        0.00  3591233.65        0.00 13480603.96 
    0.0.1.5     1.0.1.5     0.1.1.5     1.1.1.5 
       0.00    59540.46        0.00   522743.95 

Chi-square test

  • 2 by 2
svychisq(~female+dmdborn4, nhc, statistic="adjWald")

    Design-based Wald test of association

data:  svychisq(~female + dmdborn4, nhc, statistic = "adjWald")
F = 0.00052674, ndf = 1, ddf = 15, p-value = 0.982

Chi-square test continued

  • 5 by 2
svychisq(~dmdeduc2+dmqmiliz, nhc, statistic="adjWald")

    Design-based Wald test of association

data:  svychisq(~dmdeduc2 + dmqmiliz, nhc, statistic = "adjWald")
F = 11.015, ndf = 4, ddf = 12, p-value = 0.0005499

Try it yourself

  • Run the following code, and then try a few descriptive analyses (mostly categorical)
data(nhanes)
design <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, 
                    nest=TRUE, data=nhanes)
design
Stratified 1 - level Cluster Sampling design (with replacement)
With (31) clusters.
svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
    nest = TRUE, data = nhanes)

Graphing

Graphing of continuous variables: Histogram

  • Default is to show density on y-axis
svyhist(~pad680, nhc)

histogram of pad680 with density on y-axis

Histogram with count on y-axis

svyhist(~pad680, nhc, probability = FALSE)

histogram of pad680 with probability on y-axis

Boxplot

svyboxplot(~ridageyr~1, nhc, all.outliers=TRUE)

boxplot of the age variable

Boxplot broken into categories

  • The grouping variable must be a factor
svyboxplot(~ridageyr~factor(female), nhc, all.outliers=TRUE)

boxplot of age by sex

Barchart

barplt<-svyby(~ohq620+ohq630, ~female, nhc, na = TRUE, svymean)
barplot(barplt,beside=TRUE,legend=TRUE)

barplot of ohq620 and ohq630

Dotplot

dotchart(barplt)

barplot recast as a dotchart

Scatterplot with weights as bubble size

svyplot(pad680~pad820, design = nhc, style="bubble")

scatterplot of pad680 by pad820 with bubbles sized by sampling weight

Five minute break

[sleeping kitten indicating break time]

Subpopulations

Analyzing only some cases

  • Analysis of survey data and experimental data is quite different
  • Cannot delete cases from a weighted dataset
  • Cases in the subpopulation are used to calculate point estimate
  • All cases are used to calculate the standard error
  • Can create a binary variable to specify the subpopulation

Examples with minutes of sedentary activity

  • No subpopulation
svymean(~pad680, nhc, na=TRUE)
         mean     SE
pad680 363.49 5.4261
  • Mean of pad680 broken out by the variable female
svyby(~pad680, ~female, nhc, svymean, na=TRUE)
  female  pad680       se
0      0 368.010 7.794388
1      1 359.229 4.203481

Using subset.survey.design to see the results for only one group

svymean(~pad680, subset(nhc, female == 1), na=TRUE)
         mean     SE
pad680 359.23 4.2035

Mean of pad680 by two variables

  • By low-dose aspirin and female
svyby(~pad680, ~rxq510+female, nhc, svymean, na=TRUE)
    rxq510 female   pad680        se
0.0      0      0 348.3941 10.147146
1.0      1      0 384.2072  8.794646
0.1      0      1 349.9134  6.862274
1.1      1      1 369.0892  9.403969

Three variables to define the subpopulation

  • Need the “na = TRUE”
svyby(~pad680, ~dmqmiliz+rxq510+female, nhc, na = TRUE, svymean)
      dmqmiliz rxq510 female   pad680        se
0.0.0        0      0      0 345.0032 11.938892
1.0.0        1      0      0 366.2002 16.668693
0.1.0        0      1      0 377.0661  9.220377
1.1.0        1      1      0 404.9886 13.913967
0.0.1        0      0      1 348.9575  7.288462
1.0.1        1      0      1 385.2752 35.287306
0.1.1        0      1      1 369.0723  9.474665
1.1.1        1      1      1 369.9402 30.816511

Subsetting to include only males

smale <- subset(nhc,female == 0)
summary(smale)
Stratified 1 - level Cluster Sampling design (with replacement)
With (30) clusters.
subset(nhc, female == 0)
Probabilities:
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
5.849e-06 2.822e-05 4.460e-05 5.073e-05 6.848e-05 2.181e-04 
Stratum Sizes: 
           173 174 175 176 177 178 179 180 181 182 183 184 185 186 187
obs        381 416 364 358 381 391 405 377 372 359 344 430 343 311 343
design.PSU   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
actual.PSU   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
Data variables:
 [1] "dmdborn4" "dmdeduc2" "dmdhhsiz" "dmdhragz" "dmdhredz" "dmdhrgnd"
 [7] "dmdhrmaz" "dmdhsedz" "dmdmartz" "dmdyrusr" "dmqmiliz" "female"  
[13] "indfmpir" "ohq620"   "ohq630"   "ohq640"   "ohq660"   "ohq670"  
[19] "ohq680"   "ohq845"   "pad680"   "pad790q"  "pad790u"  "pad800"  
[25] "pad810q"  "pad810u"  "pad820"   "riagendr" "ridagemn" "ridageyr"
[31] "ridexagm" "ridexmon" "ridexprg" "ridreth1" "ridreth3" "ridstatr"
[37] "rxq510"   "rxq515"   "rxq520"   "sddsrvyr" "sdmvpsu"  "sdmvstra"
[43] "seqn"     "wtint2yr" "wtmec2yr"
svymean(~pad680, design=smale, na = TRUE)
         mean     SE
pad680 368.01 7.7944

Try it yourself

  • Subset on agecat; run a chi-square test
data(nhanes)
design <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, 
                    nest=TRUE, data=nhanes)
design
Stratified 1 - level Cluster Sampling design (with replacement)
With (31) clusters.
svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
    nest = TRUE, data = nhanes)

t-tests

t-tests: one sample t-test

  • Minutes of sedentary activity
svyttest(pad680~0, nhc, na = TRUE)

    Design-based one-sample t-test

data:  pad680 ~ 0
t = 66.989, df = 14, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 351.8529 375.1287
sample estimates:
    mean 
363.4908 

t-tests: paired samples t-test

  • Aching in mouth and felt bad because of mouth
svymean(~ohq630+ohq620, nhc, na = TRUE)
         mean     SE
ohq630 4.5052 0.0190
ohq620 4.1281 0.0237
svyttest(I(ohq630-ohq620)~0, nhc, na = TRUE)

    Design-based one-sample t-test

data:  I(ohq630 - ohq620) ~ 0
t = 23.107, df = 14, p-value = 1.506e-12
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 0.3421406 0.4121551
sample estimates:
     mean 
0.3771479 

t-tests: two-sample t-tests

svyttest(ridageyr~female, nhc)

    Design-based t-test

data:  ridageyr ~ female
t = 6.365, df = 14, p-value = 1.754e-05
alternative hypothesis: true difference in mean is not equal to 0
95 percent confidence interval:
 1.148072 2.315009
sample estimates:
difference in mean 
          1.731541 

t-tests continued

  • Calculating the difference by hand
  • The matrix created in the svyby call is needed for the svycontrast function
summary(svyglm(ridageyr~female, design=nhc))

Call:
svyglm(formula = ridageyr ~ female, design = nhc)

Survey design:
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, 
    nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2021)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  38.1088     0.5197  73.327  < 2e-16 ***
female        1.7315     0.2720   6.365 1.75e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 520.9637)

Number of Fisher Scoring iterations: 2

t-tests continued

a <- svyby(~ridageyr, ~female, nhc, na.rm.by = TRUE, svymean, covmat = TRUE)
vcov(a)
          0         1
0 0.2700972 0.2475816
1 0.2475816 0.2990718
a
  female ridageyr        se
0      0 38.10880 0.5197088
1      1 39.84034 0.5468746

t-tests continued

  • Calculating the difference between two means
svycontrast(a, c( -1, 1))
         contrast    SE
contrast   1.7315 0.272
39.84034 - 38.10880
[1] 1.73154

t-tests continued

  • Another example, this time using svypredmeans
summary(svyglm(ridageyr~female, design = nhc))

Call:
svyglm(formula = ridageyr ~ female, design = nhc)

Survey design:
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, 
    nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2021)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  38.1088     0.5197  73.327  < 2e-16 ***
female        1.7315     0.2720   6.365 1.75e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 520.9637)

Number of Fisher Scoring iterations: 2

Predicted means from t-test

  • Can’t have female in the model when you want to get the predicted means for female
ttest1 <- (svyglm(ridageyr ~1, design = nhc))
summary(ttest1)

Call:
svyglm(formula = ridageyr ~ 1, design = nhc)

Survey design:
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, 
    nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2021)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  38.9893     0.5149   75.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 521.7131)

Number of Fisher Scoring iterations: 2

Predicted means from t-test continued

svypredmeans(ttest1, ~factor(female))
    mean     SE
0 38.109 0.5197
1 39.840 0.5469

t-test with a subpopulation

svyttest(ridageyr~female, nhc)

    Design-based t-test

data:  ridageyr ~ female
t = 6.365, df = 14, p-value = 1.754e-05
alternative hypothesis: true difference in mean is not equal to 0
95 percent confidence interval:
 1.148072 2.315009
sample estimates:
difference in mean 
          1.731541 

Question for audience

  • What do you call a linear regression with a single dichotomous predictor?

Answer

  • A t-test!
summary(svyglm(pad680~female, design=nhc, na.action = na.omit))

Call:
svyglm(formula = pad680 ~ female, design = nhc, na.action = na.omit)

Survey design:
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, 
    nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2021)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  368.010      7.794  47.215   <2e-16 ***
female        -8.781      6.102  -1.439    0.172    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 49964.36)

Number of Fisher Scoring iterations: 2
svyttest(pad680~female, subset(nhc, dmqmiliz == 0 ) )

    Design-based t-test

data:  pad680 ~ female
t = -0.90362, df = 14, p-value = 0.3815
alternative hypothesis: true difference in mean is not equal to 0
95 percent confidence interval:
 -22.688225   9.237606
sample estimates:
difference in mean 
          -6.72531 

Linear regression

Multiple linear regression: main effects only

summary(svyglm(pad680~female+pad800, design=nhc, na.action = na.omit))

Call:
svyglm(formula = pad680 ~ female + pad800, design = nhc, na.action = na.omit)

Survey design:
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, 
    nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2021)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 389.6686    11.6894  33.335 5.61e-14 ***
female      -18.1188     7.7054  -2.351   0.0351 *  
pad800       -0.3272     0.0575  -5.690 7.42e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 46878.07)

Number of Fisher Scoring iterations: 2

Interpretation

  • Just the same as with an unweighted model
  • For a one unit change in the predictor (e.g., female), the expected change in the outcome is the coefficient, holding all other variables in the model constant
  • All coefficients control for all other coefficients

Multiple linear regression with an interaction term

summary(svyglm(pad680~female*pad800, design = nhc, na.action = na.omit))

Call:
svyglm(formula = pad680 ~ female * pad800, design = nhc, na.action = na.omit)

Survey design:
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, 
    nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2021)

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   393.25805   13.00529  30.238 1.07e-12 ***
female        -25.72500   10.84002  -2.373 0.035199 *  
pad800         -0.37573    0.07615  -4.934 0.000346 ***
female:pad800   0.12010    0.07645   1.571 0.142160    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 46862.61)

Number of Fisher Scoring iterations: 2

Multiple linear regression with an interaction term continued

glm1 <- (svyglm(pad680~female*pad800, design = nhc, na.action = na.omit))
glm1
Stratified 1 - level Cluster Sampling design (with replacement)
With (30) clusters.
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, 
    nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2021)

Call:  svyglm(formula = pad680 ~ female * pad800, design = nhc, na.action = na.omit)

Coefficients:
  (Intercept)         female         pad800  female:pad800  
     393.2581       -25.7250        -0.3757         0.1201  

Degrees of Freedom: 6339 Total (i.e. Null);  12 Residual
  (5593 observations deleted due to missingness)
Null Deviance:      300500000 
Residual Deviance: 297100000    AIC: 86400

Multiple linear regression: getting confidence intervals

confint(glm1)
                     2.5 %      97.5 %
(Intercept)   364.92195664 421.5941442
female        -49.34337506  -2.1066203
pad800         -0.54164749  -0.2098056
female:pad800  -0.04646547   0.2866697

Multiple liner regression: two categorical predictors

summary(svyglm(pad680~factor(female)*factor(dmdhrmaz), design=nhc, 
               na.action = na.omit))

Call:
svyglm(formula = pad680 ~ factor(female) * factor(dmdhrmaz), 
    design = nhc, na.action = na.omit)

Survey design:
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, 
    nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2021)

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        396.866     17.316  22.919 5.64e-10 ***
factor(female)1                    -17.001     16.670  -1.020   0.3318    
factor(dmdhrmaz)2                  -70.221     34.156  -2.056   0.0668 .  
factor(dmdhrmaz)3                   -5.949     49.498  -0.120   0.9067    
factor(female)1:factor(dmdhrmaz)2   57.922     49.112   1.179   0.2655    
factor(female)1:factor(dmdhrmaz)3  -63.635     63.464  -1.003   0.3397    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 25424.33)

Number of Fisher Scoring iterations: 2

AIC and BIC

AIC(glm1, null_has_intercept = TRUE)
       eff.p          AIC     deltabar 
    6.362728 99751.045451     1.590682 
BIC(glm1, maximal = glm1)
        p       BIC      neff 
        4 297062151       NaN 

Predicted margins

ols1 <- (svyglm(pad680~1, design=nhc, na.action = na.omit))
predmarg<-svypredmeans(ols1, ~interaction(female,dmdhrmaz))
predmarg
      mean     SE
0.1 396.87 17.316
1.1 379.86 18.002
0.2 326.64 26.614
1.3 310.28 47.484
1.2 367.57 39.700
0.3 390.92 47.087

Non-parametric tests

Non-parametric tests: Wilcoxon signed rank test

  • Non-parametric version of a matched pairs t-test
wil <- svyranktest(ohq630~ohq620, design = nhc, na = TRUE, test = c("wilcoxon"))
wil

    Design-based KruskalWallis test

data:  ohq630 ~ ohq620
df = 4, Chisq = 1811.4, p-value = 4.026e-12

Non-parametric tests: Median test

  • Non-parametric version of a independent samples t-test
mtest <- svyranktest(pad680~female, design = nhc, na = TRUE, test=("median"))
mtest

    Design-based median test

data:  pad680 ~ female
t = -0.34042, df = 14, p-value = 0.7386
alternative hypothesis: true difference in mean rank score is not equal to 0
sample estimates:
difference in mean rank score 
                 -0.005688636 

Non-parametric tests: Kruskal Wallis test

  • Non-parametric version of an ANOVA
kwtest <- svyranktest(ohq620~dmdmartz, design = nhc, na = TRUE, 
                      test=("KruskalWallis"))
kwtest

    Design-based KruskalWallis test

data:  ohq620 ~ dmdmartz
df = 4, Chisq = 24.281, p-value = 0.007872

Logistic regression

Create a binary variable

  • Creating a binary outcome variable
nhanes2021$tghealth = NA
nhanes2021$tghealth[nhanes2021$ohq845 == 1] = 1
nhanes2021$tghealth[nhanes2021$ohq845 == 2] = 1
nhanes2021$tghealth[nhanes2021$ohq845 == 3] = 0
nhanes2021$tghealth[nhanes2021$ohq845 == 4] = 0
nhanes2021$tghealth[nhanes2021$ohq845 == 5] = 0

Need to reissue the svydesign command

nhc1 <- svydesign(id=~sdmvpsu, weights=~wtint2yr,strata=~sdmvstra, nest=TRUE, 
                 survey.lonely.psu = "adjust", data=nhanes2021)

Logistic regression

logit1 <- (svyglm(tghealth~factor(dmdeduc2)+ridageyr, family=quasibinomial, 
                  design=nhc1, na.action = na.omit))
summary(logit1)

Call:
svyglm(formula = tghealth ~ factor(dmdeduc2) + ridageyr, design = nhc1, 
    family = quasibinomial, na.action = na.omit)

Survey design:
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, 
    nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2021)

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1.250701   0.220296  -5.677 0.000205 ***
factor(dmdeduc2)2  0.095245   0.218273   0.436 0.671848    
factor(dmdeduc2)3  0.621578   0.185369   3.353 0.007326 ** 
factor(dmdeduc2)4  0.975127   0.173591   5.617 0.000222 ***
factor(dmdeduc2)5  1.795251   0.193574   9.274 3.16e-06 ***
ridageyr          -0.005801   0.001958  -2.963 0.014219 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 1.147022)

Number of Fisher Scoring iterations: 4

Interpretation

  • Just the same as with an unweighted model
  • For a one unit change in the predictor (e.g., female), the expected change in the predicted log odds of the outcome is the coefficient, holding all other variables in the model constant
  • All coefficients control for all other coefficients

Logistic regression using a subpopulation

subset1 <- subset(nhc1, ridageyr > 20)
logit2 <- (svyglm(tghealth~factor(dmdeduc2)+ridageyr, 
                  family=quasibinomial, design=subset1, na.action = na.omit))
summary(logit2)

Call:
svyglm(formula = tghealth ~ factor(dmdeduc2) + ridageyr, design = subset1, 
    family = quasibinomial, na.action = na.omit)

Survey design:
subset(nhc1, ridageyr > 20)

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1.285928   0.217597  -5.910 0.000149 ***
factor(dmdeduc2)2  0.074280   0.207288   0.358 0.727538    
factor(dmdeduc2)3  0.601982   0.191304   3.147 0.010391 *  
factor(dmdeduc2)4  0.974348   0.175175   5.562 0.000240 ***
factor(dmdeduc2)5  1.801740   0.191866   9.391 2.82e-06 ***
ridageyr          -0.005143   0.001891  -2.720 0.021576 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 1.001688)

Number of Fisher Scoring iterations: 4

Logistic regression: getting pseudo-R-square values

psrsq(logit2, method = c("Cox-Snell"))
[1] 0.07726501
psrsq(logit2, method = c("Nagelkerke"))
[1] 0.1046439

Try it yourself

  • Use HI_CHOL as the outcome and any of the other variables as predictors
data(nhanes)
design <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, 
                    nest=TRUE, data=nhanes)
design
Stratified 1 - level Cluster Sampling design (with replacement)
With (31) clusters.
svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
    nest = TRUE, data = nhanes)

Ordered logistic

ologit1 <- svyolr(factor(dmdeduc2)~factor(female)+factor(dmdborn4)+pad680, 
                  design = nhc, method = c("logistic"))
summary(ologit1)
Call:
svyolr(factor(dmdeduc2) ~ factor(female) + factor(dmdborn4) + 
    pad680, design = nhc, method = c("logistic"))

Coefficients:
                        Value   Std. Error   t value
factor(female)1   0.169926364 0.0472898498  3.593295
factor(dmdborn4)1 0.234984893 0.1373352880  1.711031
pad680            0.001808744 0.0001728276 10.465596

Intercepts:
    Value    Std. Error t value 
1|2  -2.4906   0.2214   -11.2488
2|3  -1.3229   0.1911    -6.9208
3|4   0.3635   0.1564     2.3249
4|5   1.6192   0.1773     9.1339
(4226 observations deleted due to missingness)

Count model

Poisson model

summary(svyglm(pad800~female, design=nhc, family=poisson()))

Call:
svyglm(formula = pad800 ~ female, design = nhc, family = poisson())

Survey design:
svydesign(id = ~sdmvpsu, weights = ~wtint2yr, strata = ~sdmvstra, 
    nest = TRUE, survey.lonely.psu = "adjust", data = nhanes2021)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.30368    0.02881 149.376  < 2e-16 ***
female      -0.27657    0.02811  -9.839 1.14e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 67.8306)

Number of Fisher Scoring iterations: 5

Other things

Weighted Principle Components Analysis

pc <- svyprcomp(~pad680+ohq620+ohq630, design=nhc,scale=TRUE,scores=TRUE)
pc
Standard deviations (1, .., p=3):
[1] 1.4538469 0.8541645 0.3958942

Rotation (n x k) = (3 x 3):
              PC1        PC2        PC3
pad680 -0.4433069 -0.8948138 -0.0527938
ohq620 -0.6281802  0.3521469 -0.6938171
ohq630 -0.6394283  0.2744099  0.7182135

PCA plot

biplot(pc, weight="scaled")

Cronbach’s alpha

svycralpha(~dmdhrmaz+dmdborn4, design=nhc, na.rm = TRUE)
   *alpha* 
0.06773242 

Hot off the press!

  • Thomas Lumley’s newest package: vgam (Vector Generalised Linear Models)
    • https://cran.r-project.org/web/packages/svyVGAM/index.html
    • proportional odds model (ordinal logistic regression)
    • different algorithm than used by svyolr()
    • zero-inflated Poisson
    • multinomial logistic regression

Wrapping up

More information on NHANES data

  • Helpful resources
    • Listserv: http://www.cdc.gov/nchs/nhanes/nhanes_listserv.htm
    • Online tutorials: http://www.cdc.gov/nchs/tutorials/index.htm
    • NHANES data are also available from IPUMS

Questions????

[slide asking for questions]

References

  • Applied Survey Data Analysis by Steven G. Heeringa, Brady T. West, and Patricia A. Berglund
  • Analysis of Health Surveys by Edward L. Korn and Barry I. Graubard
  • Sampling of Populations: Methods and Applications, Fourth Edition by Paul Levy and Stanley Lemeshow
  • Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner

References continued

  • Sampling Techniques, Third Edition by William G. Cochran. 1977.
  • Pfeffermann, D., Skinner, C. J., Holmes, D. J., Goldstein, H., and Rasbash, J. (1998), Weighting for Unequal Selection Probabilities in Multilevel Models, Journal of the Royal Statistical Society, Series B, 60, 23-40.
  • Rabe-Hesketh, S. and Skrondal, A. (2006). Multilevel Modelling of Complex Survey Data, Journal of the Royal Statistical Society, Series A, 169, 805-827.

References continued

  • Carle, Adam C. (2009). Fitting Multilevel Models in Complex Survey Data with Design Weights: Recommendations. BMC Medical Research Methodology; 9(49).
  • Quartagno, M., Carpenter, R., and Goldstein, H. (2019). Multiple Imputation with Survey Weights: A Multilevel Approach. Journal of Survey Statistics and Methodology, Volume 8, Issue 5, November 2020, Pages 965–989, https://doi.org/10.1093/jssam/smz036

Thank you!!

[slide thanking audience for attending]