Applied Survey Data Analysis using SAS 9.4

SAS code file for live presentation (right-click to download)

The purpose of this workshop is to explore some issues in the analysis of survey data using SAS 9.44 and SAS/Stat 15.2. Most of code shown in this seminar will work in earlier versions of SAS and SAS/Stat. To find out what version of SAS and SAS/Stat you are running, open SAS and look at the information in the log file.

NOTE: SAS (r) Proprietary Software 9.4 (TS1M7)

NOTE: Updated analytical products:

      SAS/STAT 15.2
      SAS/ETS 15.2
      SAS/OR 15.2
      SAS/IML 15.2
      SAS/QC 15.2

There are seven survey procedures.

proc surveyselect: This procedure can be used to select a sample from a dataset.

proc surveyimpute: This procedure can be used to do single imputations on a survey dataset.

proc surveymeans: This procedure can be used to obtain weighted descriptive statistics for continuous variables. This procedure can produce graphs.

proc surveyfreq: This procedure can be used to run weighted one-way and multi-way crosstabulations. This procedure can produce graphs.

proc surveyregress: This procedure can be used to run weighted OLS regressions.

proc surveylogistic: This procedure can be used to run weighted logistic, ordinal, multinomial and probit regressions.

proc surveyphreg: This procedure can be used to run weighted proportional hazards regression.

We will also briefly discuss proc glimmix.

proc glimmix: This procedure will allow for sampling weights, so it can be used to run weighted multilevel models. This procedure does not have a strata, cluster or a domain statement, and it does not allow for replicate weights. It requires that a sampling weight be specified at each level of the model.

Why do we need survey data analysis software?

Regular statistical software (that is not designed for survey data) analyzes data as if the data were collected using simple random sampling. For experimental and quasi-experimental designs, this is exactly what we want. However, very few surveys use a simple random sample to collect data. Not only is it nearly impossible to do so, but it is not as efficient (either financially and statistically) as other sampling methods. When any sampling method other than simple random sampling is used, we usually need to use survey data analysis software to take into account the differences between the design that was used to collect the data and simple random sampling. This is because the sampling design affects both the calculation of the point estimates and the standard errors of those estimates. If you ignore the sampling design, e.g., if you assume simple random sampling when another type of sampling design was used, both the point estimates and their standard errors will likely be calculated incorrectly. The sampling weight will affect the calculation of the point estimate, and the stratification and/or clustering will affect the calculation of the standard errors. Ignoring the clustering will likely lead to standard errors that are underestimated, possibly leading to results that seem to be statistically significant, when in fact, they are not. The difference in point estimates and standard errors obtained using non-survey software and survey software with the design properly specified will vary from data set to data set, and even between analyses using the same data set. While it may be possible to get reasonably accurate results using non-survey software, there is no practical way to know beforehand how far off the results from non-survey software will be.

Sampling designs

Most people do not conduct their own surveys. Rather, they use survey data that some agency or company collected and made available to the public. The documentation must be read carefully to find out what kind of sampling design was used to collect the data. This is very important because many of the estimates and standard errors are calculated differently for the different sampling designs. Hence, if you mis-specify the sampling design, the point estimates and standard errors will likely be wrong.

Below are some common features of many sampling designs.

Sampling weights: There are several types of weights that can be associated with a survey. Perhaps the most common is the sampling weight. A sampling weight is a probability weight that has had one or more adjustments made to it. Both a sampling weight and a probability weight are used to weight the sample back to the population from which the sample was drawn. By definition, a probability weight is the inverse of the probability of being included in the sample due to the sampling design (except for a certainty PSU, see below). The probability weight, called a pweight in Stata, is calculated as N/n, where N = the number of elements in the population and n = the number of elements in the sample. For example, if a population has 10 elements and 3 are sampled at random with replacement, then the probability weight would be 10/3 = 3.33. In a two-stage design, the probability weight is calculated as f₁f₂, which means that the inverse of the sampling fraction for the first stage is multiplied by the inverse of the sampling fraction for the second stage. Under many sampling plans, the sum of the probability weights will equal the population total.

While many textbooks will end their discussion of probability weights here, this definition does not fully describe the sampling weights that are included with actual survey data sets. Rather, the sampling weight, which is sometimes called a “final weight,” starts with the inverse of the sampling fraction, but then incorporates several other values, such as corrections for unit non-response, errors in the sampling frame (sometimes called non-coverage), and poststratification. Because these other values are included in the probability weight that is included with the data set, it is often inadvisable to modify the sampling weights, such as trying to standardize them for a particular variable, e.g., age.

PSU: This is the primary sampling unit. This is the first unit that is sampled in the design. For example, school districts from California may be sampled and then schools within districts may be sampled. The school district would be the PSU. If states from the US were sampled, and then school districts from within each state, and then schools from within each district, then states would be the PSU. One does not need to use the same sampling method at all levels of sampling. For example, probability-proportional-to-size sampling may be used at level 1 (to select states), while cluster sampling is used at level 2 (to select school districts). In the case of a simple random sample, the PSUs and the elementary units are the same. In general, accounting for the clustering in the data (i.e., using the PSUs), will increase the standard errors of the point estimates. Conversely, ignoring the PSUs will tend to yield standard errors that are too small, leading to false positives when doing significance tests.

Strata: Stratification is a method of breaking up the population into different groups, often by demographic variables such as gender, race or SES. Each element in the population must belong to one, and only one, strata. Once the strata have been defined, samples are taken from each stratum as if it were independent of all of the other strata. For example, if a sample is to be stratified on gender, men and women would be sampled independently of one another. This means that the probability weights for men will likely be different from the probability weights for the women. In most cases, you need to have two or more PSUs in each stratum. The purpose of stratification is to reduce the standard error of the estimates, and stratification works most effectively when the variance of the dependent variable is smaller within the strata than in the sample as a whole.

FPC: This is the finite population correction. This is used when the sampling fraction (the number of elements or respondents sampled relative to the population) becomes large. The FPC is used in the calculation of the standard error of the estimate. If the value of the FPC is close to 1, it will have little impact and can be safely ignored. In some survey data analysis programs, such as SUDAAN, this information will be needed if you specify that the data were collected without replacement (see below for a definition of “without replacement”). The formula for calculating the FPC is ((N-n)/(N-1))^1/2, where N is the number of elements in the population and n is the number of elements in the sample. To see the impact of the FPC for samples of various proportions, suppose that you had a population of 10,000 elements.

Sample size (n)    FPC
1                1.0000
10                .9995 
100               .9950
500               .9747
1000              .9487
5000              .7071
9000              .3162

Replicate weights: Replicate weights are a series of weight variables that are used to correct the standard errors for the sampling plan. They serve the same function as the PSU and strata variables (which are used a Taylor series linearization) to correct the standard errors of the estimates for the sampling design. Many public use data sets are now being released with replicate weights instead of PSUs and strata in an effort to more securely protect the identity of the respondents. In theory, the same standard errors will be obtained using either the PSU and strata or the replicate weights. There are different ways of creating replicate weights; the method used is determined by the sampling plan. The most common are balanced repeated and jackknife replicate weights. You will need to read the documentation for the survey data set carefully to learn what type of replicate weight is included in the data set; specifying the wrong type of replicate weight will likely lead to incorrect standard errors. For more information on replicate weights, please see Stata Library: Replicate Weights and Appendix D of the WesVar Manual by Westat, Inc. Several statistical packages, including Stata, SAS, SUDAAN, WesVar and R, allow the use of replicate weights.

Consequences of not using the design elements

Sampling design elements include the sampling weights, post-stratification weights (if provided), PSUs, strata, and replicate weights. Rarely are all of these elements included in a single public-use data set. However, ignoring the design elements that are included can often lead to inaccurate point estimates and/or inaccurate standard errors.

Sampling with and without replacement

Most samples collected in the real world are collected “without replacement”. This means that once a respondent has been selected to be in the sample and has participated in the survey, that particular respondent cannot be selected again to be in the sample. Many of the calculations change depending on if a sample is collected with or without replacement. Hence, programs like SUDAAN request that you specify if a survey sampling design was implemented with our without replacement, and an FPC is used if sampling without replacement is used, even if the value of the FPC is very close to one.

Examples

For the examples in this workshop, we will use the data set from NHANES 2011-2012. The data set and documentation can be downloaded from the NHANES web site. The data files can be downloaded as SAS.xpt files.

Reading the documentation

The first step in analyzing any survey data set is to read the documentation. With many of the public use data sets, the documentation can be quite extensive and sometimes even intimidating. Instead of trying to read the documentation “cover to cover”, there are some parts you will want to focus on. First, read the Introduction. This is usually an “easy read” and will orient you to the survey. There is usually a section or chapter called something like “Sample Design and Analysis Guidelines”, “Variance Estimation”, etc. This is the part that tells you about the design elements included with the survey and how to use them. Some even give example code. If multiple sampling weights have been included in the data set, there will be some instruction about when to use which one. If there is a section or chapter on missing data or imputation, please read that. This will tell you how missing data were handled. You should also read any documentation regarding the specific variables that you intend to use. As we will see little later on, we will need to look at the documentation to get the value labels for the variables. This is especially important because some of the values are actually missing data codes, and you need to do something so that SAS doesn’t treat those as valid values (or you will get some very “interesting” means, totals, etc.).

The variables

We will use about a dozen different variables in the examples in this workshop. Below is a brief summary of them. Some of the variables have been recoded to be binary variables (values of 2 recoded to a value of 0). The count of missing observations includes values truly missing as well as refused and don’t know.

ridageyr – Age in years at exam – recoded; range of values: 0 – 79 are actual values, 80 = 80+ years of age

pad630 – How much time do you spend doing moderate-intensity activities on a type work day?; range of values: 10-960 (minutes), 7053 missing observations

hsq496 – During the past 30 days, for about how many days have you felt worried, tense or anxious?; range of values: 0-30; 3073 missing observations

female – Recode of the variable riagendr; 0 = male, 1 = female; no missing observations

dmdborn4 – Country of birth; 1 = born in the United States, 0 = otherwise; 5 missing observations

dmdmartl – Marital status; 1 = married, 2 = widowed, 3 = divorced, 4 = separated, 5 = never married, 6 = living with partner; 4203 missing observations

dmdeduc2 – Education level of adults aged 20+ years; 1 = less than 9th grade, 2 = 9-11th grade, 3 = high school graduate, GED or equivalent, 4 = some college or AA degree, 5 = college graduate or above; 4201 missing observations

pad675 – How much time do you spend doing moderate-intensity sports, fitness, or recreation activities on a typical day?; range of values: 10-600 (minutes); 6220 missing observations

hsq571 – During the past 12 months, have you donated blood?; 0 = no, 1 = yes; 3673 missing observations

pad680 – How much time do you usually spend sitting on a typical day?; range of values: 0-1380 (minutes); 2365 missing observations

paq665 – Do you do any moderate-intensity sports, fitness or recreational activities that cause a small increase in breathing or heart rate at least 10 minutes continually?; 0 = no, 1 = yes; 2329 missing observations

hsd010 – Would you say that your general health is…; 1 = excellent, 2 = very good, 3 = good, 4 = fair, 5 = poor; 3064 missing observations

hsq470 – number of days in the last 30 days that physical health is not good; range of values: 0 – 30 (days), 3075 missing observations

hsq480 – number of days in the last 30 days that mental health is not good; range of values: 0 – 30 (days), 3073 missing observations

There are three other variables that we should identify. One is the sampling weight variable. It is wtint2yr. The cluster variable is sdmvpsu and the stratification variable is sdmvstra. There are 14 strata and 31 clusters in this dataset. Let’s briefly look at each of these variables.

proc means data = nhanes2012 n min mean max sum;
var wtint2yr;
run;
The MEANS Procedure

                    Analysis Variable : WTINT2YR

   N         Minimum            Mean         Maximum             Sum
--------------------------------------------------------------------
9756         3320.89        31425.86       220233.32       306590681
--------------------------------------------------------------------

We see that there are 9756 observations in the dataset. The average weight is 31425.86, with a minimum of 3320.89 and a maximum of 220233.32. What does this mean? Each row of data in this dataset has a value for the sampling weight. The person who contributed that row of data represents that many people in the population. What is “the population?” Quoting from the NHANES documentation ( NHANES 2011-2012 Overview ). The NHANES target population is the noninstitutionalized civilian resident population of the United States. The sum of the weights, 306,590,681, is the estimated number of people in the population. However, if you look at other sources for the population of the United States in 2012, you will see something like 314.1 million.

Now let’s look at the cluster and strata variables.


proc freq data = nhanes2012;
tables sdmvpsu sdmvstra;
run;
The FREQ Procedure

                                    Cumulative    Cumulative
SDMVPSU    Frequency     Percent     Frequency      Percent
------------------------------------------------------------
      1        4374       44.83          4374        44.83
      2        4490       46.02          8864        90.86
      3         892        9.14          9756       100.00


                                     Cumulative    Cumulative
SDMVSTRA    Frequency     Percent     Frequency      Percent
-------------------------------------------------------------
      90         862        8.84           862         8.84
      91         998       10.23          1860        19.07
      92         875        8.97          2735        28.03
      93         602        6.17          3337        34.20
      94         688        7.05          4025        41.26
      95         722        7.40          4747        48.66
      96         676        6.93          5423        55.59
      97         608        6.23          6031        61.82
      98         708        7.26          6739        69.08
      99         682        6.99          7421        76.07
     100         700        7.18          8121        83.24
     101         715        7.33          8836        90.57
     102         624        6.40          9460        96.97
     103         296        3.03          9756       100.00


proc freq data = nhanes2012;
tables sdmvpsu*sdmvstra;
run;
The FREQ Procedure

Table of SDMVPSU by SDMVSTRA

SDMVPSU     SDMVSTRA

Frequency|
Percent  |
Row Pct  |
Col Pct  |      90|      91|      92|      93|      94|      95|      96|  Total
---------+--------+--------+--------+--------+--------+--------+--------+
       1 |    278 |    309 |    328 |    276 |    322 |    348 |    336 |   4374
         |   2.85 |   3.17 |   3.36 |   2.83 |   3.30 |   3.57 |   3.44 |  44.83
         |   6.36 |   7.06 |   7.50 |   6.31 |   7.36 |   7.96 |   7.68 |
         |  32.25 |  30.96 |  37.49 |  45.85 |  46.80 |  48.20 |  49.70 |
---------+--------+--------+--------+--------+--------+--------+--------+
       2 |    351 |    333 |    244 |    326 |    366 |    374 |    340 |   4490
         |   3.60 |   3.41 |   2.50 |   3.34 |   3.75 |   3.83 |   3.49 |  46.02
         |   7.82 |   7.42 |   5.43 |   7.26 |   8.15 |   8.33 |   7.57 |
         |  40.72 |  33.37 |  27.89 |  54.15 |  53.20 |  51.80 |  50.30 |
---------+--------+--------+--------+--------+--------+--------+--------+
       3 |    233 |    356 |    303 |      0 |      0 |      0 |      0 |    892
         |   2.39 |   3.65 |   3.11 |   0.00 |   0.00 |   0.00 |   0.00 |   9.14
         |  26.12 |  39.91 |  33.97 |   0.00 |   0.00 |   0.00 |   0.00 |
         |  27.03 |  35.67 |  34.63 |   0.00 |   0.00 |   0.00 |   0.00 |
---------+--------+--------+--------+--------+--------+--------+--------+
Total         862      998      875      602      688      722      676     9756
             8.84    10.23     8.97     6.17     7.05     7.40     6.93   100.00
(Continued)
Frequency|
Percent  |
Row Pct  |
Col Pct  |      97|      98|      99|     100|     101|     102|     103|  Total
---------+--------+--------+--------+--------+--------+--------+--------+
       1 |    316 |    388 |    362 |    343 |    358 |    270 |    140 |   4374
         |   3.24 |   3.98 |   3.71 |   3.52 |   3.67 |   2.77 |   1.44 |  44.83
         |   7.22 |   8.87 |   8.28 |   7.84 |   8.18 |   6.17 |   3.20 |
         |  51.97 |  54.80 |  53.08 |  49.00 |  50.07 |  43.27 |  47.30 |
---------+--------+--------+--------+--------+--------+--------+--------+
       2 |    292 |    320 |    320 |    357 |    357 |    354 |    156 |   4490
         |   2.99 |   3.28 |   3.28 |   3.66 |   3.66 |   3.63 |   1.60 |  46.02
         |   6.50 |   7.13 |   7.13 |   7.95 |   7.95 |   7.88 |   3.47 |
         |  48.03 |  45.20 |  46.92 |  51.00 |  49.93 |  56.73 |  52.70 |
---------+--------+--------+--------+--------+--------+--------+--------+
       3 |      0 |      0 |      0 |      0 |      0 |      0 |      0 |    892
         |   0.00 |   0.00 |   0.00 |   0.00 |   0.00 |   0.00 |   0.00 |   9.14
         |   0.00 |   0.00 |   0.00 |   0.00 |   0.00 |   0.00 |   0.00 |
         |   0.00 |   0.00 |   0.00 |   0.00 |   0.00 |   0.00 |   0.00 |
---------+--------+--------+--------+--------+--------+--------+--------+
Total         608      708      682      700      715      624      296     9756
             6.23     7.26     6.99     7.18     7.33     6.40     3.03   100.00

This tells us is that there are two clusters (AKA PSUs) per strata. This is pretty typical for a survey dataset. The numbering of the clusters and strata does not matter in most statistical software packages.

Descriptive statistics

We will start by calculating some descriptive statistics of some of the continuous variables. We will use proc surveymeans to get some basic information regarding the continuous variable ridageyr.

* descriptives with a continuous variable;
proc surveymeans data = nhanes2012;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
var ridageyr;
run;

The SURVEYMEANS Procedure

            Data Summary

Number of Strata                  14
Number of Clusters                31
Number of Observations          9756
Sum of Weights             306590681


                                   Statistics

                                               Std Error
Variable               N            Mean         of Mean       95% CL for Mean
---------------------------------------------------------------------------------
RIDAGEYR            9756       37.185195        0.696477    35.7157572 38.6546320
---------------------------------------------------------------------------------

We see some familiar numbers in this output. We see the 14 strata, 31 clusters, 9756 observations, and the estimated population total of 306,590,681.

There are many options that you can use. The options are usually included on the proc statement. The range option gives the range, which is the maximum minus the minimum.

* with some options;
proc surveymeans data = nhanes2012 min mean max range;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
var ridageyr;
run;

The SURVEYMEANS Procedure

            Data Summary

Number of Strata                  14
Number of Clusters                31
Number of Observations          9756
Sum of Weights             306590681


                                       Statistics

                                                                               Std Error
Variable         Minimum         Maximum           Range            Mean         of Mean
----------------------------------------------------------------------------------------
RIDAGEYR               0       80.000000       80.000000       37.185195        0.696477
----------------------------------------------------------------------------------------

Notice that the output includes only the statistics requested on the proc surveymeans statement.

proc surveymeans data = nhanes2012 quartiles;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
var ridageyr;
run;

The SURVEYMEANS Procedure

            Data Summary

Number of Strata                  14
Number of Clusters                31
Number of Observations          9756
Sum of Weights             306590681


                                    Quantiles

Variable       Percentile        Estimate       Std Error    95% Confidence Limits
----------------------------------------------------------------------------------
RIDAGEYR          25% Q1        17.514078        0.580720    16.2888667 18.7392897
                  50% Median    36.205625        1.394588    33.2633021 39.1479485
                  75% Q3        54.651947        0.925135    52.7000816 56.6038114
----------------------------------------------------------------------------------

* other options include deciles, quartiles, median, q1, q3, and specific values;
proc surveymeans data = nhanes2012 percentile = (10 25 50 75 90);
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
var ridageyr;
run;

The SURVEYMEANS Procedure

            Data Summary

Number of Strata                  14
Number of Clusters                31
Number of Observations          9756
Sum of Weights             306590681


                                    Quantiles

Variable       Percentile        Estimate       Std Error    95% Confidence Limits
----------------------------------------------------------------------------------
RIDAGEYR          10% D1         6.557412        0.318603     5.8852178  7.2296057
                  25% Q1        17.514078        0.580720    16.2888667 18.7392897
                  50% Median    36.205625        1.394588    33.2633021 39.1479485
                  75% Q3        54.651947        0.925135    52.7000816 56.6038114
                  90% D9        67.552258        1.029785    65.3796014 69.7249153
----------------------------------------------------------------------------------

In the example below, five options are specified. The nmiss option shows the number of missing values for the variable pad630 (How much time do you spend doing moderate-intensity activities on a type work day?). The df option shows the degrees of freedom used. The degrees of freedom are equal to the number of clusters (PSUs) minus the number of strata. In this example, 31 – 14 = 17. The cv option gives the coefficient of variation, which is the standard deviation divided by the mean. The geomean option gives the geometric mean, which is the nth root of n numbers. It is sometimes used when combining items that have different ranges. The gmstderr option gives the standard error of the geometric mean.

* using some options;
proc surveymeans data = nhanes2012 nmiss df cv geomean gmstderr;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
var pad630;
run;

The SURVEYMEANS Procedure

            Data Summary

Number of Strata                  14
Number of Clusters                31
Number of Observations          9756
Sum of Weights             306590681


                    Statistics

                                          Coeff of
Variable          N Miss        DF       Variation
--------------------------------------------------
PAD630              7702        17        0.039883
--------------------------------------------------

            Geometric Means

               Geometric
Variable            Mean       Std Error
----------------------------------------
PAD630         90.048920        3.648271
----------------------------------------

Notice that SAS does not do a listwise deletion of missing values across all of the variables listed on the var statement. (Notice that the N is different for each of the three variables listed in the output.)

* does not do a listwise deletion across multiple variables;
proc surveymeans data = nhanes2012;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
var ridageyr pad630 hsq496;
run;

The SURVEYMEANS Procedure

            Data Summary

Number of Strata                  14
Number of Clusters                31
Number of Observations          9756
Sum of Weights             306590681


                                   Statistics

                                               Std Error
Variable               N            Mean         of Mean       95% CL for Mean
---------------------------------------------------------------------------------
RIDAGEYR            9756       37.185195        0.696477     35.715757  38.654632
PAD630              2054      139.887377        5.579060    128.116590 151.658164
HSQ496              5883        5.383908        0.189951      4.983147   5.784669
---------------------------------------------------------------------------------

The ODS graphics must be turned on for SAS to produce the graphs. If you do not submit the ods graphics on; statement, SAS will give you all of the output from proc surveymeans except for the graphs. There will be a warning in the log file indicating that ODS graphics must be turned on in order to get the graphs.

* getting some graphs;
ods graphics on;
proc surveymeans data = nhanes2012 plots = all;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
var ridageyr;
run;
ods graphics off;

<output omitted>

The diamond is the mean (37.19), and the line is the median (36.21).

* getting one graph at a time;
ods graphics on;
proc surveymeans data = nhanes2012 plots = boxplot;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
var ridageyr;
run;
ods graphics off;



* getting just the histogram;
ods graphics on;
proc surveymeans data = nhanes2012 plots = histogram;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
var ridageyr;
run;
ods graphics off;

There are a few different ways to get descriptive statistics with categorical variables. You can use proc surveymeans if your variable is binary (i.e., coded 0/1).

* descriptives with a binary variable;
* this is actually a proportion;
proc surveymeans data = nhanes2012;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
var female;
run;

The SURVEYMEANS Procedure

            Data Summary

Number of Strata                  14
Number of Clusters                31
Number of Observations          9756
Sum of Weights             306590681


                                   Statistics

                                               Std Error
Variable               N            Mean         of Mean       95% CL for Mean
---------------------------------------------------------------------------------
female              9756        0.511952        0.006440    0.49836568 0.52553917
---------------------------------------------------------------------------------

Probably the most common procedure for getting descriptive statistics for categorical variables is proc surveyfreq. The tables statement in proc surveyfreq works the same way that the tables statement in proc freq works.

* this might be more common;
proc surveyfreq data = nhanes2012;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
tables female;
run;

The SURVEYFREQ Procedure

            Data Summary

Number of Strata                  14
Number of Clusters                31
Number of Observations          9756
Sum of Weights             306590681


                             Table of female

                          Weighted    Std Dev of               Std Err of
female     Frequency     Frequency      Wgt Freq    Percent       Percent
-------------------------------------------------------------------------
     0          4856     149630839       8783128    48.8048        0.6440
     1          4900     156959842      11234711    51.1952        0.6440

 Total          9756     306590681      19723273    100.000
-------------------------------------------------------------------------

You may want to use formats to help label the output.

* using formats;
proc surveyfreq data = nhanes2012;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
tables female*dmdborn4;
format female fm. dmdborn4 cb.;
run;

The SURVEYFREQ Procedure

            Data Summary

Number of Strata                  14
Number of Clusters                31
Number of Observations          9756
Sum of Weights             306590681


                                Table of female by DMDBORN4

                                            Weighted    Std Dev of               Std Err of
female    DMDBORN4           Frequency     Frequency      Wgt Freq    Percent       Percent
-------------------------------------------------------------------------------------------
male      born elsewhere          1027      22449131       2368069     7.3247        0.8655
          born in US              3825     127102676       8587374    41.4710        0.6134

          Total                   4852     149551807       8773197    48.7957        0.6428
-------------------------------------------------------------------------------------------
female    born elsewhere          1056      23543299       1926175     7.6817        0.7451
          born in US              3843     133390830      11273589    43.5227        1.2458

          Total                   4899     156934129      11235137    51.2043        0.6428
-------------------------------------------------------------------------------------------
Total     born elsewhere          2083      45992430       4177655    15.0064        1.5756
          born in US              7668     260493506      19670647    84.9936        1.5756

          Total                   9751     306485936      19715992    100.000
-------------------------------------------------------------------------------------------
                                   Frequency Missing = 5

In the next example, several options were used. The expected option gives the expected frequencies for each cell in the table. The row option gives the row percentages. The col option gives the column percentages. The chisq option gives the Rao-Scott chi-square test; lrchisq option gives the likelihood ratio chi-square test; the wchisq option gives the Wald chi-square test; the wllchisq option gives the Wald log-linear chi-square test.

proc surveyfreq data = nhanes2012;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
tables female*dmdborn4 / expected row col chisq lrchisq wchisq wllchisq;
format female fm. dmdborn4 cb.;
run;

The SURVEYFREQ Procedure

            Data Summary

Number of Strata                  14
Number of Clusters                31
Number of Observations          9756
Sum of Weights             306590681


                                   Table of female by DMDBORN4

                                         Weighted   Std Dev of     Expected             Std Err of
female   DMDBORN4          Frequency    Frequency     Wgt Freq     Wgt Freq   Percent      Percent
--------------------------------------------------------------------------------------------------
male     born elsewhere         1027     22449131      2368069     22442305    7.3247       0.8655
         born in US             3825    127102676      8587374    127109501   41.4710       0.6134

         Total                  4852    149551807      8773197                48.7957       0.6428
--------------------------------------------------------------------------------------------------
female   born elsewhere         1056     23543299      1926175     23550124    7.6817       0.7451
         born in US             3843    133390830     11273589    133384005   43.5227       1.2458

         Total                  4899    156934129     11235137                51.2043       0.6428
--------------------------------------------------------------------------------------------------
Total    born elsewhere         2083     45992430      4177655                15.0064       1.5756
         born in US             7668    260493506     19670647                84.9936       1.5756

         Total                  9751    306485936     19715992                100.000
--------------------------------------------------------------------------------------------------
                                      Frequency Missing = 5

                       Table of female by DMDBORN4

                              Row     Std Err of     Column     Std Err of
female   DMDBORN4         Percent    Row Percent    Percent    Col Percent
--------------------------------------------------------------------------
male     born elsewhere   15.0109         1.6401    48.8105         1.2315
         born in US       84.9891         1.6401    48.7930         0.6756

         Total            100.000
--------------------------------------------------------------------------
female   born elsewhere   15.0020         1.5769    51.1895         1.2315
         born in US       84.9980         1.5769    51.2070         0.6756

         Total            100.000
--------------------------------------------------------------------------
Total    born elsewhere                             100.000
         born in US                                 100.000

         Total
--------------------------------------------------------------------------
                          Frequency Missing = 5

Table of female by DMDBORN4

 Rao-Scott Chi-Square Test

Pearson Chi-Square    0.0002
Design Correction     0.7893

Rao-Scott Chi-Square  0.0002
DF                         1
Pr > ChiSq            0.9889

F Value               0.0002
Num DF                     1
Den DF                    17
Pr > F                0.9891

     Sample Size = 9751


  Rao-Scott Likelihood Ratio Test

Likelihood Ratio Chi-Square  0.0002
Design Correction            0.7893

Rao-Scott Chi-Square         0.0002
DF                                1
Pr > ChiSq                   0.9889

F Value                      0.0002
Num DF                            1
Den DF                           17
Pr > F                       0.9891

        Sample Size = 9751


 Wald Chi-Square Test

Chi-Square      0.0002

F Value         0.0002
Num DF               1
Den DF              17
Pr > F          0.9891

  Sample Size = 9751

Table of female by DMDBORN4

Wald Log-Linear Chi-Square Test

Chi-Square      0.0002

F Value         0.0002
Num DF               1
Den DF              17
Pr > F          0.9891

  Sample Size = 9751

In the next example, we will use three options. The cv option displays coefficients of variation for percentages. The definition of “coefficient of variation” is that it is the standard deviation / mean, or, in our case, the standard error divided by the point estimate. For example, for males born elsewhere for the percentage, .8655/7.3247 = .1182. The cvwt option displays coefficients of variation for weighted frequencies. For example, for males born elsewhere: 2368069/22449131 = .1055. The deff option displays design effects for percentages. This attempts to quantify the extent to which the observed sampling error differs from what would be expected if SRS had been used. It is defined as variance(observed) / variance(SRS). It can be thought of as a measure of efficiency. If the design effect is 1, then the current analysis with the current sampling plan is as efficient as the same analysis using a SRS. If the design effect is less than 1, then the current analysis with the current sample is more efficient than the same analysis with SRS. If the design effect is greater than 1, then the current analysis with the current sample is less efficient than the same analysis with a SRS. In general, clustering increases the design effect.

This is related to the idea of an “effective sample size”. For example, males born elsewhere: 1027/10.7602 = 95.44 total born elsewhere: 2083/18.9775 = 109.76.

proc surveyfreq data = nhanes2012;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
tables female*dmdborn4 / cv cvwt deff;
format female fm. dmdborn4 cb.;
run;

The SURVEYFREQ Procedure

            Data Summary

Number of Strata                  14
Number of Clusters                31
Number of Observations          9756
Sum of Weights             306590681


                                   Table of female by DMDBORN4

                                         Weighted   Std Dev of      CV for             Std Err of
female   DMDBORN4          Frequency    Frequency     Wgt Freq    Wgt Freq   Percent      Percent
-------------------------------------------------------------------------------------------------
male     born elsewhere         1027     22449131      2368069      0.1055    7.3247       0.8655
         born in US             3825    127102676      8587374      0.0676   41.4710       0.6134

         Total                  4852    149551807      8773197      0.0587   48.7957       0.6428
-------------------------------------------------------------------------------------------------
female   born elsewhere         1056     23543299      1926175      0.0818    7.6817       0.7451
         born in US             3843    133390830     11273589      0.0845   43.5227       1.2458

         Total                  4899    156934129     11235137      0.0716   51.2043       0.6428
-------------------------------------------------------------------------------------------------
Total    born elsewhere         2083     45992430      4177655      0.0908   15.0064       1.5756
         born in US             7668    260493506     19670647      0.0755   84.9936       1.5756

         Total                  9751    306485936     19715992      0.0643   100.000
-------------------------------------------------------------------------------------------------
                                      Frequency Missing = 5

          Table of female by DMDBORN4

                             CV for      Design
female   DMDBORN4           Percent      Effect
-----------------------------------------------
male     born elsewhere      0.1182     10.7602
         born in US          0.0148      1.5114

         Total               0.0132      1.6126
-----------------------------------------------
female   born elsewhere      0.0970      7.6319
         born in US          0.0286      6.1566

         Total               0.0126      1.6126
-----------------------------------------------
Total    born elsewhere      0.1050     18.9775
         born in US          0.0185     18.9775

         Total
-----------------------------------------------
             Frequency Missing = 5

Let’s look at some graphs. Using the format will put the labels on the x-axis on the graph.

ods graphics on;
proc surveyfreq data = nhanes2012;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
tables dmdmartl / plots = wtfreqplot;
format dmdmartl matsat.;
run;
ods graphics off;

The SURVEYFREQ Procedure

            Data Summary

Number of Strata                  14
Number of Clusters                31
Number of Observations          9756
Sum of Weights             306590681


                                  Table of DMDMARTL

                                       Weighted    Std Dev of               Std Err of
DMDMARTL                Frequency     Frequency      Wgt Freq    Percent       Percent
--------------------------------------------------------------------------------------
married                      2683     118822198      10556102    53.0792        2.0495
widowed                       467      12586462       1087437     5.6225        0.3200
divorced                      571      23926362       2606988    10.6882        0.7248
separated                     204       5366932        614868     2.3975        0.3161
never married                1188      44479637       4687152    19.8695        2.3629
living with partner           440      18676762       1874572     8.3431        0.6473

Total                        5553     223858353      14237911    100.000
--------------------------------------------------------------------------------------
                               Frequency Missing = 4203

* choose more interesting variables;
ods graphics on;
proc surveyfreq data = nhanes2012;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
tables female*dmdborn4 / plots = mosaicplot;
format female fm. dmdborn4 cb.;
run;
ods graphics off;

The SURVEYFREQ Procedure

            Data Summary

Number of Strata                  14
Number of Clusters                31
Number of Observations          9756
Sum of Weights             306590681


                                Table of female by DMDBORN4

                                            Weighted    Std Dev of               Std Err of
female    DMDBORN4           Frequency     Frequency      Wgt Freq    Percent       Percent
-------------------------------------------------------------------------------------------
male      born elsewhere          1027      22449131       2368069     7.3247        0.8655
          born in US              3825     127102676       8587374    41.4710        0.6134

          Total                   4852     149551807       8773197    48.7957        0.6428
-------------------------------------------------------------------------------------------
female    born elsewhere          1056      23543299       1926175     7.6817        0.7451
          born in US              3843     133390830      11273589    43.5227        1.2458

          Total                   4899     156934129      11235137    51.2043        0.6428
-------------------------------------------------------------------------------------------
Total     born elsewhere          2083      45992430       4177655    15.0064        1.5756
          born in US              7668     260493506      19670647    84.9936        1.5756

          Total                   9751     306485936      19715992    100.000
-------------------------------------------------------------------------------------------
                                   Frequency Missing = 5

If you are requesting more than one plot, you need to enclose the plots in parentheses. The or option will give the odds ratios, and the risk option will give risks.

ods graphics on;
proc surveyfreq data = nhanes2012;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
tables dmdmartl*female*dmdborn4 / risk or plots =(oddsratioplot relriskplot);
format dmdmartl matsat. female fm. dmdborn4 cb.;
run;
ods graphics off;

The SURVEYFREQ Procedure

            Data Summary

Number of Strata                  14
Number of Clusters                31
Number of Observations          9756
Sum of Weights             306590681


                                Table of female by DMDBORN4
                             Controlling for DMDMARTL=married

                                            Weighted    Std Dev of               Std Err of
female    DMDBORN4           Frequency     Frequency      Wgt Freq    Percent       Percent
-------------------------------------------------------------------------------------------
male      born elsewhere           554      11803069       1179456     9.9367        1.1615
          born in US               874      46849089       4983912    39.4410        1.0838

          Total                   1428      58652159       5189428    49.3777        0.5675
-------------------------------------------------------------------------------------------
female    born elsewhere           476      11167772       1125718     9.4019        1.0815
          born in US               777      48962647       5297262    41.2204        1.3449

          Total                   1253      60130420       5451479    50.6223        0.5675
-------------------------------------------------------------------------------------------
Total     born elsewhere          1030      22970842       2253478    19.3386        2.2056
          born in US              1651      95811737      10207366    80.6614        2.2056

          Total                   2681     118782579      10555944    100.000
-------------------------------------------------------------------------------------------


                Column 1 Risk Estimates

                        Standard
                Risk       Error   95% Confidence Limits

Row 1         0.2012      0.0228       0.1532     0.2492
Row 2         0.1857      0.0220       0.1393     0.2321
Total         0.1934      0.0221       0.1469     0.2399

Difference    0.0155      0.0078      -0.0010     0.0320

             Difference is (Row 1 - Row 2)

                   Sample Size = 5549

                Column 2 Risk Estimates

                        Standard
                Risk       Error   95% Confidence Limits

Row 1         0.7988      0.0228       0.7508     0.8468
Row 2         0.8143      0.0220       0.7679     0.8607
Total         0.8066      0.0221       0.7601     0.8531

Difference   -0.0155      0.0078      -0.0320     0.0010

             Difference is (Row 1 - Row 2)

                   Sample Size = 5549


           Odds Ratio and Relative Risks (Row1/Row2)

                            Estimate       95% Confidence Limits

Odds Ratio                    1.1046        0.9937        1.2278
Column 1 Relative Risk        1.0835        0.9944        1.1806
Column 2 Relative Risk        0.9809        0.9609        1.0014

                       Sample Size = 5549


                                Table of female by DMDBORN4
                             Controlling for DMDMARTL=widowed

                                            Weighted    Std Dev of               Std Err of
female    DMDBORN4           Frequency     Frequency      Wgt Freq    Percent       Percent
-------------------------------------------------------------------------------------------
male      born elsewhere            23        299145         63828     2.3767        0.5554
          born in US                99       2433302        326571    19.3327        2.1606

          Total                    122       2732447        337748    21.7094        2.3135
-------------------------------------------------------------------------------------------
female    born elsewhere            82       1350041        202386    10.7261        1.8146
          born in US               263       8503975        963871    67.5645        3.0838

          Total                    345       9854015        951308    78.2906        2.3135
-------------------------------------------------------------------------------------------
Total     born elsewhere           105       1649186        210849    13.1029        1.9883
          born in US               362      10937276       1100321    86.8971        1.9883

          Total                    467      12586462       1087437    100.000
-------------------------------------------------------------------------------------------

                Column 1 Risk Estimates

                        Standard
                Risk       Error   95% Confidence Limits

Row 1         0.1095      0.0237       0.0595     0.1594
Row 2         0.1370      0.0239       0.0865     0.1875
Total         0.1310      0.0199       0.0891     0.1730

Difference   -0.0275      0.0316      -0.0941     0.0391

             Difference is (Row 1 - Row 2)

                   Sample Size = 5549


                Column 2 Risk Estimates

                        Standard
                Risk       Error   95% Confidence Limits

Row 1         0.8905      0.0237       0.8406     0.9405
Row 2         0.8630      0.0239       0.8125     0.9135
Total         0.8690      0.0199       0.8270     0.9109

Difference    0.0275      0.0316      -0.0391     0.0941

             Difference is (Row 1 - Row 2)

                   Sample Size = 5549


           Odds Ratio and Relative Risks (Row1/Row2)

                            Estimate       95% Confidence Limits

Odds Ratio                    0.7744        0.4141        1.4482
Column 1 Relative Risk        0.7991        0.4607        1.3860
Column 2 Relative Risk        1.0319        0.9564        1.1134

                       Sample Size = 5549

                                Table of female by DMDBORN4
                             Controlling for DMDMARTL=divorced

                                            Weighted    Std Dev of               Std Err of
female    DMDBORN4           Frequency     Frequency      Wgt Freq    Percent       Percent
-------------------------------------------------------------------------------------------
male      born elsewhere            32        646692        190158     2.7028        0.8131
          born in US               205       8640920       1381509    36.1146        3.6127

          Total                    237       9287612       1406845    38.8175        3.6318
-------------------------------------------------------------------------------------------
female    born elsewhere            70       1676180        267604     7.0056        1.2917
          born in US               264      12962570       1692897    54.1769        3.5153

          Total                    334      14638749       1727388    61.1825        3.6318
-------------------------------------------------------------------------------------------
Total     born elsewhere           102       2322871        384346     9.7084        1.8069
          born in US               469      21603490       2586103    90.2916        1.8069

          Total                    571      23926362       2606988    100.000
-------------------------------------------------------------------------------------------


                Column 1 Risk Estimates

                        Standard
                Risk       Error   95% Confidence Limits

Row 1         0.0696      0.0211       0.0252     0.1141
Row 2         0.1145      0.0204       0.0715     0.1575
Total         0.0971      0.0181       0.0590     0.1352

Difference   -0.0449      0.0210      -0.0893    -0.0005

             Difference is (Row 1 - Row 2)

                   Sample Size = 5549


                Column 2 Risk Estimates

                        Standard
                Risk       Error   95% Confidence Limits

Row 1         0.9304      0.0211       0.8859     0.9748
Row 2         0.8855      0.0204       0.8425     0.9285
Total         0.9029      0.0181       0.8648     0.9410

             Difference is (Row 1 - Row 2)

                   Sample Size = 5549

                Column 2 Risk Estimates

                        Standard
                Risk       Error   95% Confidence Limits

Difference    0.0449      0.0210       0.0005     0.0893

             Difference is (Row 1 - Row 2)

                   Sample Size = 5549


           Odds Ratio and Relative Risks (Row1/Row2)

                            Estimate       95% Confidence Limits

Odds Ratio                    0.5788        0.3154        1.0619
Column 1 Relative Risk        0.6081        0.3466        1.0669
Column 2 Relative Risk        1.0507        1.0006        1.1033

                       Sample Size = 5549


                                Table of female by DMDBORN4
                            Controlling for DMDMARTL=separated

                                            Weighted    Std Dev of               Std Err of
female    DMDBORN4           Frequency     Frequency      Wgt Freq    Percent       Percent
-------------------------------------------------------------------------------------------
male      born elsewhere            36        798720        216266    14.8822        3.5300
          born in US                43       1220641        289387    22.7437        4.3090

          Total                     79       2019361        376397    37.6260        4.7175
-------------------------------------------------------------------------------------------
female    born elsewhere            54       1387134        213988    25.8459        3.9489
          born in US                71       1960437        364269    36.5281        4.9696

          Total                    125       3347571        413911    62.3740        4.7175
-------------------------------------------------------------------------------------------
Total     born elsewhere            90       2185854        249474    40.7282        3.2853
          born in US               114       3181078        458084    59.2718        3.2853

          Total                    204       5366932        614868    100.000
-------------------------------------------------------------------------------------------

                Column 1 Risk Estimates

                        Standard
                Risk       Error   95% Confidence Limits

Row 1         0.3955      0.0822       0.2222     0.5689
Row 2         0.4144      0.0599       0.2880     0.5408
Total         0.4073      0.0329       0.3380     0.4766

Difference   -0.0188      0.1254      -0.2835     0.2458

             Difference is (Row 1 - Row 2)

                   Sample Size = 5549


                Column 2 Risk Estimates

                        Standard
                Risk       Error   95% Confidence Limits

Row 1         0.6045      0.0822       0.4311     0.7778
Row 2         0.5856      0.0599       0.4592     0.7120
Total         0.5927      0.0329       0.5234     0.6620

Difference    0.0188      0.1254      -0.2458     0.2835

             Difference is (Row 1 - Row 2)

                   Sample Size = 5549


           Odds Ratio and Relative Risks (Row1/Row2)

                            Estimate       95% Confidence Limits

Odds Ratio                    0.9248        0.3077        2.7793
Column 1 Relative Risk        0.9545        0.4948        1.8413
Column 2 Relative Risk        1.0322        0.6625        1.6082

                       Sample Size = 5549

                                Table of female by DMDBORN4
                          Controlling for DMDMARTL=never married

                                            Weighted    Std Dev of               Std Err of
female    DMDBORN4           Frequency     Frequency      Wgt Freq    Percent       Percent
-------------------------------------------------------------------------------------------
male      born elsewhere           150       3945129        570145     8.8807        1.1797
          born in US               482      21151252       2485409    47.6125        1.9035

          Total                    632      25096380       2750788    56.4931        1.9690
-------------------------------------------------------------------------------------------
female    born elsewhere           133       3316186        623580     7.4649        1.1880
          born in US               421      16011195       2027247    36.0420        2.3406

          Total                    554      19327382       2250319    43.5069        1.9690
-------------------------------------------------------------------------------------------
Total     born elsewhere           283       7261315       1072947    16.3456        2.0345
          born in US               903      37162447       4176512    83.6544        2.0345

          Total                   1186      44423762       4681954    100.000
-------------------------------------------------------------------------------------------


                Column 1 Risk Estimates

                        Standard
                Risk       Error   95% Confidence Limits

Row 1         0.1572      0.0196       0.1158     0.1986
Row 2         0.1716      0.0287       0.1110     0.2321
Total         0.1635      0.0203       0.1205     0.2064

Difference   -0.0144      0.0254      -0.0680     0.0393

             Difference is (Row 1 - Row 2)

                   Sample Size = 5549


                Column 2 Risk Estimates

                        Standard
                Risk       Error   95% Confidence Limits

Row 1         0.8428      0.0196       0.8014     0.8842
Row 2         0.8284      0.0287       0.7679     0.8890
Total         0.8365      0.0203       0.7936     0.8795

             Difference is (Row 1 - Row 2)

                   Sample Size = 5549

                Column 2 Risk Estimates

                        Standard
                Risk       Error   95% Confidence Limits

Difference    0.0144      0.0254      -0.0393     0.0680

             Difference is (Row 1 - Row 2)

                   Sample Size = 5549


           Odds Ratio and Relative Risks (Row1/Row2)

                            Estimate       95% Confidence Limits

Odds Ratio                    0.9006        0.6143        1.3202
Column 1 Relative Risk        0.9162        0.6665        1.2593
Column 2 Relative Risk        1.0174        0.9537        1.0853

                       Sample Size = 5549


                                Table of female by DMDBORN4
                       Controlling for DMDMARTL=living with partner

                                            Weighted    Std Dev of               Std Err of
female    DMDBORN4           Frequency     Frequency      Wgt Freq    Percent       Percent
-------------------------------------------------------------------------------------------
male      born elsewhere            69       2257394        449540    12.0866        1.9362
          born in US               168       7272128        876424    38.9368        2.4468

          Total                    237       9529522       1065259    51.0234        1.7000
-------------------------------------------------------------------------------------------
female    born elsewhere            64       2000636        413700    10.7119        2.0043
          born in US               139       7146604        832299    38.2647        2.5917

          Total                    203       9147240        910699    48.9766        1.7000
-------------------------------------------------------------------------------------------
Total     born elsewhere           133       4258030        799318    22.7985        3.5450
          born in US               307      14418732       1572309    77.2015        3.5450

          Total                    440      18676762       1874572    100.000
-------------------------------------------------------------------------------------------

                Column 1 Risk Estimates

                        Standard
                Risk       Error   95% Confidence Limits

Row 1         0.2369      0.0380       0.1567     0.3170
Row 2         0.2187      0.0414       0.1313     0.3061
Total         0.2280      0.0355       0.1532     0.3028

Difference    0.0182      0.0358      -0.0574     0.0937

             Difference is (Row 1 - Row 2)

                   Sample Size = 5549


                Column 2 Risk Estimates

                        Standard
                Risk       Error   95% Confidence Limits

Row 1         0.7631      0.0380       0.6830     0.8433
Row 2         0.7813      0.0414       0.6939     0.8687
Total         0.7720      0.0355       0.6972     0.8468

Difference   -0.0182      0.0358      -0.0937     0.0574

             Difference is (Row 1 - Row 2)

                   Sample Size = 5549


           Odds Ratio and Relative Risks (Row1/Row2)

                            Estimate       95% Confidence Limits

Odds Ratio                    1.1089        0.7191        1.7100
Column 1 Relative Risk        1.0831        0.7741        1.5155
Column 2 Relative Risk        0.9767        0.8859        1.0769

                       Sample Size = 5549

Analysis of subpopulations

Before we continue, we should pause to discuss the analysis of subpopulations. The analysis of subpopulations is one place where survey data and experimental data are quite different. If you have data from an experiment (or quasi-experiment), and you want to analyze the responses from, say, just the women, or just people over age 50, you can just delete the unwanted cases from the data set or use a by statement. Survey data are different. With survey data, you (almost) never get to delete any cases from the data set, even if you will never use them in any of your analyses. Because of the way the by statement works, you usually don’t use it with survey data either. Instead, SAS has provided a domain statement in most survey procedures that allows you to correctly analyze subpopulations of your survey data. A domain and a subpopulation are the same thing. The domain statement is very similar to using a by statement in that you will get output for each level of the variable (or variables) listed on the statement. This means that you will often times get more output that you want; you simply ignore the output for domains that are not of interest to you. Please note that there is no domain statement in proc surveyfreq; you are expected to include the variables that you would have put on the domain statement on the tables statement.

First, however, let’s take a second to see why deleting cases from a survey data set can be so problematic. If the data set is subset (meaning that observations not to be included in the subpopulation are deleted from the data set), two problems arise. First, the estimated number of elements in the population cannot be correctly calculated because some numbers are missing as you sum down the column of sampling weights. Secondly, the standard errors of the estimates cannot be calculated correctly. When the subpopulation option(s) is used, only the cases defined by the subpopulation are used in the calculation of the estimate, but all cases are used in the calculation of the standard errors. For more information on this issue, please see Sampling Techniques, Third Edition by William G. Cochran (1977) and Small Area Estimation by J. N. K. Rao (2003).

We will begin with an analysis that we have seen before.

* subpops;
proc surveymeans data = nhanes2012;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
var pad630;
run;

The SURVEYMEANS Procedure

            Data Summary

Number of Strata                  14
Number of Clusters                31
Number of Observations          9756
Sum of Weights             306590681


                                   Statistics

                                               Std Error
Variable               N            Mean         of Mean       95% CL for Mean
---------------------------------------------------------------------------------
PAD630              2054      139.887377        5.579060    128.116590 151.658164
---------------------------------------------------------------------------------

Now let’s add the domain statement. The format statement is not technically needed, but it is a nice way to more clearly label the output. If you were interested only in the mean for females, you would ignore the output for males.

proc surveymeans data = nhanes2012;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
domain female;
var pad630;
format female fm.;
run;

The SURVEYMEANS Procedure

            Data Summary

Number of Strata                  14
Number of Clusters                31
Number of Observations          9756
Sum of Weights             306590681


                                   Statistics

                                               Std Error
Variable               N            Mean         of Mean       95% CL for Mean
---------------------------------------------------------------------------------
PAD630              2054      139.887377        5.579060    128.116590 151.658164
---------------------------------------------------------------------------------


                                  Domain Analysis: female

                                                         Std Error
female    Variable               N            Mean         of Mean       95% CL for Mean
-------------------------------------------------------------------------------------------
male      PAD630              1136      155.627196        7.008380    140.840807 170.413584
female    PAD630               918      121.684284        5.352345    110.391824 132.976744
-------------------------------------------------------------------------------------------

In this example, we include two variables on the domain statement. Notice that this is the same as running proc surveymeans twice with each of the variables on the domain statement in turn.

proc surveymeans data = nhanes2012;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
domain female dmdmartl;
var pad630;
format dmdmartl matsat. female fm.;
run;

The SURVEYMEANS Procedure

            Data Summary

Number of Strata                  14
Number of Clusters                31
Number of Observations          9756
Sum of Weights             306590681


                                   Statistics

                                               Std Error
Variable               N            Mean         of Mean       95% CL for Mean
---------------------------------------------------------------------------------
PAD630              2054      139.887377        5.579060    128.116590 151.658164
---------------------------------------------------------------------------------


                                  Domain Analysis: female

                                                         Std Error
female    Variable               N            Mean         of Mean       95% CL for Mean
-------------------------------------------------------------------------------------------
male      PAD630              1136      155.627196        7.008380    140.840807 170.413584
female    PAD630               918      121.684284        5.352345    110.391824 132.976744
-------------------------------------------------------------------------------------------


                                  Domain Analysis: DMDMARTL

                                                              Std Error
DMDMARTL             Variable             N          Mean       of Mean     95% CL for Mean
----------------------------------------------------------------------------------------------
married              PAD630             841    139.439779      6.869524  124.946351 153.933208
widowed              PAD630              96    122.878708      9.530837  102.770399 142.987017
divorced             PAD630             166    164.023980     12.070142  138.558206 189.489754
separated            PAD630              70    195.067692     37.432563  116.091889 274.043496
never married        PAD630             363    140.786819     10.691710  118.229282 163.344355
living with partner  PAD630             169    165.483679     10.304480  143.743126 187.224232
----------------------------------------------------------------------------------------------

In this example, we cross the domain variables, giving us each combination of the two variables.

proc surveymeans data = nhanes2012;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
domain female*dmdmartl;
var pad630;
format dmdmartl matsat. female fm.;
run;

The SURVEYMEANS Procedure

            Data Summary

Number of Strata                  14
Number of Clusters                31
Number of Observations          9756
Sum of Weights             306590681


                                   Statistics

                                               Std Error
Variable               N            Mean         of Mean       95% CL for Mean
---------------------------------------------------------------------------------
PAD630              2054      139.887377        5.579060    128.116590 151.658164
---------------------------------------------------------------------------------


                            Domain Analysis: female*DMDMARTL

                                                                                Std Error
female    DMDMARTL               Variable               N            Mean         of Mean
-----------------------------------------------------------------------------------------
male      married                PAD630               494      156.694056        8.765612
          widowed                PAD630                23      132.611146       20.775755
          divorced               PAD630                77      209.422973       27.286194
          separated              PAD630                32      187.968526       38.861059
          never married          PAD630               214      146.423066       12.353287
          living with partner    PAD630               102      201.422030       13.418949
female    married                PAD630               347      118.625504        9.590405
          widowed                PAD630                73      120.956913       11.104393
          divorced               PAD630                89      124.634331       15.505908
          separated              PAD630                38      200.933304       52.499861
-----------------------------------------------------------------------------------------


                 Domain Analysis: female*DMDMARTL

female    DMDMARTL               Variable       95% CL for Mean
------------------------------------------------------------------
male      married                PAD630      138.200232 175.187881
          widowed                PAD630       88.778134 176.444159
          divorced               PAD630      151.854136 266.991811
          separated              PAD630      105.978858 269.958193
          never married          PAD630      120.359908 172.486223
          living with partner    PAD630      173.110522 229.733538
female    married                PAD630       98.391518 138.859490
          widowed                PAD630       97.528692 144.385134
          divorced               PAD630       91.919726 157.348937
          separated              PAD630       90.168280 311.698328
------------------------------------------------------------------

                            Domain Analysis: female*DMDMARTL

                                                                                Std Error
female    DMDMARTL               Variable               N            Mean         of Mean
-----------------------------------------------------------------------------------------
female    never married          PAD630               149      131.709083       13.305583
          living with partner    PAD630                67      122.573635       14.885114
-----------------------------------------------------------------------------------------


                 Domain Analysis: female*DMDMARTL

female    DMDMARTL               Variable       95% CL for Mean
------------------------------------------------------------------
female    never married          PAD630      103.636758 159.781409
          living with partner    PAD630       91.168789 153.978481
------------------------------------------------------------------

In this example, we cross three variables on the domain statement.

proc surveymeans data = nhanes2012;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
domain female*dmdmartl*dmdeduc2;
var pad630;
format dmdmartl matsat. female fm. dmdeduc2 edu.;
run;

The SURVEYMEANS Procedure

            Data Summary

Number of Strata                  14
Number of Clusters                31
Number of Observations          9756
Sum of Weights             306590681


                                   Statistics

                                               Std Error
Variable               N            Mean         of Mean       95% CL for Mean
---------------------------------------------------------------------------------
PAD630              2054      139.887377        5.579060    128.116590 151.658164
---------------------------------------------------------------------------------



                            Domain Analysis: female*DMDMARTL*DMDEDUC2

female   DMDMARTL              DMDEDUC2                    Variable              N           Mean
-------------------------------------------------------------------------------------------------
male     married               less than 9th grade         PAD630               44     175.262504
                               no hs diploma               PAD630               68     191.892326
                               hs grad or GED              PAD630              111     165.884895
                               some college or AA degree   PAD630              166     155.183990
-------------------------------------------------------------------------------------------------

                    Domain Analysis: female*DMDMARTL*DMDEDUC2

                                                                         Std Error
female   DMDMARTL              DMDEDUC2                    Variable        of Mean
----------------------------------------------------------------------------------
male     married               less than 9th grade         PAD630        28.123585
                               no hs diploma               PAD630        18.189105
                               hs grad or GED              PAD630         9.758340
                               some college or AA degree   PAD630        13.216380
----------------------------------------------------------------------------------


                         Domain Analysis: female*DMDMARTL*DMDEDUC2

female   DMDMARTL              DMDEDUC2                    Variable      95% CL for Mean
-------------------------------------------------------------------------------------------
male     married               less than 9th grade         PAD630     115.926926 234.598081
                               no hs diploma               PAD630     153.516670 230.267982
                               hs grad or GED              PAD630     145.296598 186.473193
                               some college or AA degree   PAD630     127.299865 183.068115
-------------------------------------------------------------------------------------------

The SURVEYMEANS Procedure


                            Domain Analysis: female*DMDMARTL*DMDEDUC2

female   DMDMARTL              DMDEDUC2                    Variable              N           Mean
-------------------------------------------------------------------------------------------------
male     married               college grad or above       PAD630              105     133.266995
         widowed               less than 9th grade         PAD630                4     187.343598
                               no hs diploma               PAD630                4     167.204010
                               hs grad or GED              PAD630                2     103.256927
                               some college or AA degree   PAD630                9     125.939470
                               college grad or above       PAD630                4     123.887738
         divorced              less than 9th grade         PAD630                0              .
                               no hs diploma               PAD630               13     238.105548
                               hs grad or GED              PAD630               28     159.620092
                               some college or AA degree   PAD630               28     271.147197
                               college grad or above       PAD630                8     181.383414
         separated             less than 9th grade         PAD630                6     205.133253
                               no hs diploma               PAD630                7     199.085099
                               hs grad or GED              PAD630                9     279.838117
                               some college or AA degree   PAD630                8      81.121846
                               college grad or above       PAD630                2     237.326057
         never married         less than 9th grade         PAD630                8     221.332123
                               no hs diploma               PAD630               24     260.453605
                               hs grad or GED              PAD630               53     169.780030
                               some college or AA degree   PAD630               93     125.941216
                               college grad or above       PAD630               36     104.576598
         living with partner   less than 9th grade         PAD630                9     188.915139
                               no hs diploma               PAD630               21     257.398122
                               hs grad or GED              PAD630               28     215.157719
                               some college or AA degree   PAD630               36     198.181294
                               college grad or above       PAD630                8     105.547477
female   married               less than 9th grade         PAD630               14     123.187262
                               no hs diploma               PAD630               36     143.361616
                               hs grad or GED              PAD630               60     111.808026
                               some college or AA degree   PAD630              121     146.288791
                               college grad or above       PAD630              116      88.919333
         widowed               less than 9th grade         PAD630                5     176.095932
                               no hs diploma               PAD630               17      90.605348
                               hs grad or GED              PAD630               15     110.480512
                               some college or AA degree   PAD630               27     101.838525
                               college grad or above       PAD630                9     260.996206
         divorced              less than 9th grade         PAD630                3      54.288869
                               no hs diploma               PAD630               10     163.202217
                               hs grad or GED              PAD630               23     145.507378
                               some college or AA degree   PAD630               39     129.588332
                               college grad or above       PAD630               14      76.006626
         separated             less than 9th grade         PAD630                4     169.359877
                               no hs diploma               PAD630               10     165.848237
                               hs grad or GED              PAD630               10     302.022494
                               some college or AA degree   PAD630               10     142.532187
-------------------------------------------------------------------------------------------------

The SURVEYMEANS Procedure

                    Domain Analysis: female*DMDMARTL*DMDEDUC2

                                                                         Std Error
female   DMDMARTL              DMDEDUC2                    Variable        of Mean
----------------------------------------------------------------------------------
male     married               college grad or above       PAD630        10.848793
         widowed               less than 9th grade         PAD630        62.966878
                               no hs diploma               PAD630        14.100541
                               hs grad or GED              PAD630        20.074215
                               some college or AA degree   PAD630        35.754504
                               college grad or above       PAD630        52.934045
         divorced              less than 9th grade         PAD630                .
                               no hs diploma               PAD630        29.710731
                               hs grad or GED              PAD630        17.529990
                               some college or AA degree   PAD630        56.919768
                               college grad or above       PAD630        33.002598
         separated             less than 9th grade         PAD630        35.534816
                               no hs diploma               PAD630        38.033471
                               hs grad or GED              PAD630        48.201023
                               some college or AA degree   PAD630        29.502335
                               college grad or above       PAD630        84.810682
         never married         less than 9th grade         PAD630        60.764206
                               no hs diploma               PAD630        41.069041
                               hs grad or GED              PAD630        18.531657
                               some college or AA degree   PAD630        11.381655
                               college grad or above       PAD630        26.159027
         living with partner   less than 9th grade         PAD630        71.240615
                               no hs diploma               PAD630        43.612166
                               hs grad or GED              PAD630        24.033974
                               some college or AA degree   PAD630        25.916937
                               college grad or above       PAD630         7.580480
female   married               less than 9th grade         PAD630        33.994125
                               no hs diploma               PAD630        14.005356
                               hs grad or GED              PAD630        12.927725
                               some college or AA degree   PAD630        22.892464
                               college grad or above       PAD630         8.306940
         widowed               less than 9th grade         PAD630        72.509645
                               no hs diploma               PAD630        17.212933
                               hs grad or GED              PAD630        37.211889
                               some college or AA degree   PAD630        11.742558
                               college grad or above       PAD630        63.187041
         divorced              less than 9th grade         PAD630        15.497090
                               no hs diploma               PAD630        47.287308
                               hs grad or GED              PAD630        25.434714
                               some college or AA degree   PAD630        28.490532
                               college grad or above       PAD630        11.322661
         separated             less than 9th grade         PAD630        80.576093
                               no hs diploma               PAD630        46.573934
                               hs grad or GED              PAD630       117.298227
                               some college or AA degree   PAD630        26.011047
----------------------------------------------------------------------------------

The SURVEYMEANS Procedure


                         Domain Analysis: female*DMDMARTL*DMDEDUC2

female   DMDMARTL              DMDEDUC2                    Variable      95% CL for Mean
-------------------------------------------------------------------------------------------
male     married               college grad or above       PAD630     110.378042 156.155948
         widowed               less than 9th grade         PAD630      54.495098 320.192099
                               no hs diploma               PAD630     137.454469 196.953551
                               hs grad or GED              PAD630      60.904035 145.609819
                               some college or AA degree   PAD630      50.504061 201.374879
                               college grad or above       PAD630      12.206665 235.568810
         divorced              less than 9th grade         PAD630              .          .
                               no hs diploma               PAD630     175.421385 300.789710
                               hs grad or GED              PAD630     122.635045 196.605139
                               some college or AA degree   PAD630     151.056984 391.237409
                               college grad or above       PAD630     111.754017 251.012810
         separated             less than 9th grade         PAD630     130.161344 280.105162
                               no hs diploma               PAD630     118.841490 279.328709
                               hs grad or GED              PAD630     178.142847 381.533387
                               some college or AA degree   PAD630      18.877360 143.366332
                               college grad or above       PAD630      58.391158 416.260955
         never married         less than 9th grade         PAD630      93.130856 349.533391
                               no hs diploma               PAD630     173.805502 347.101708
                               hs grad or GED              PAD630     130.681652 208.878407
                               some college or AA degree   PAD630     101.928024 149.954409
                               college grad or above       PAD630      49.385875 159.767320
         living with partner   less than 9th grade         PAD630      38.610579 339.219699
                               no hs diploma               PAD630     165.384495 349.411749
                               hs grad or GED              PAD630     164.450467 265.864972
                               some college or AA degree   PAD630     143.501336 252.861252
                               college grad or above       PAD630      89.554062 121.540892
female   married               less than 9th grade         PAD630      51.465926 194.908597
                               no hs diploma               PAD630     113.812897 172.910335
                               hs grad or GED              PAD630      84.532910 139.083141
                               some college or AA degree   PAD630      97.989913 194.587668
                               college grad or above       PAD630      71.393222 106.445445
         widowed               less than 9th grade         PAD630      23.113955 329.077910
                               no hs diploma               PAD630      54.289234 126.921462
                               hs grad or GED              PAD630      31.970289 188.990735
                               some college or AA degree   PAD630      77.063892 126.613157
                               college grad or above       PAD630     127.683203 394.309208
         divorced              less than 9th grade         PAD630      21.592867  86.984872
                               no hs diploma               PAD630      63.434717 262.969717
                               hs grad or GED              PAD630      91.844821 199.169934
                               some college or AA degree   PAD630      69.478564 189.698101
                               college grad or above       PAD630      52.117900  99.895353
         separated             less than 9th grade         PAD630      -0.640820 339.360573
                               no hs diploma               PAD630      67.585825 264.110649
                               hs grad or GED              PAD630      54.544868 549.500120
                               some college or AA degree   PAD630      87.653676 197.410699
-------------------------------------------------------------------------------------------

The SURVEYMEANS Procedure


                            Domain Analysis: female*DMDMARTL*DMDEDUC2

female   DMDMARTL              DMDEDUC2                    Variable              N           Mean
-------------------------------------------------------------------------------------------------
female   separated             college grad or above       PAD630                4     117.921631
         never married         less than 9th grade         PAD630                4     309.991244
                               no hs diploma               PAD630               10     166.333455
                               hs grad or GED              PAD630               19     190.285276
                               some college or AA degree   PAD630               79     109.549232
                               college grad or above       PAD630               37     126.519184
         living with partner   less than 9th grade         PAD630                2     120.000000
                               no hs diploma               PAD630                9     144.995045
                               hs grad or GED              PAD630               17      98.815434
                               some college or AA degree   PAD630               26     152.592251
                               college grad or above       PAD630               13      89.011298
-------------------------------------------------------------------------------------------------

                    Domain Analysis: female*DMDMARTL*DMDEDUC2

                                                                         Std Error
female   DMDMARTL              DMDEDUC2                    Variable        of Mean
----------------------------------------------------------------------------------
female   separated             college grad or above       PAD630        34.795950
         never married         less than 9th grade         PAD630        93.152302
                               no hs diploma               PAD630        46.458980
                               hs grad or GED              PAD630        55.647515
                               some college or AA degree   PAD630        15.629741
                               college grad or above       PAD630        19.292689
         living with partner   less than 9th grade         PAD630                0
                               no hs diploma               PAD630        24.397178
                               hs grad or GED              PAD630        21.160766
                               some college or AA degree   PAD630        23.746801
                               college grad or above       PAD630        23.175916
----------------------------------------------------------------------------------


                         Domain Analysis: female*DMDMARTL*DMDEDUC2

female   DMDMARTL              DMDEDUC2                    Variable      95% CL for Mean
-------------------------------------------------------------------------------------------
female   separated             college grad or above       PAD630      44.508594 191.334667
         never married         less than 9th grade         PAD630     113.457066 506.525422
                               no hs diploma               PAD630      68.313575 264.353336
                               hs grad or GED              PAD630      72.879282 307.691269
                               some college or AA degree   PAD630      76.573361 142.525104
                               college grad or above       PAD630      85.815170 167.223199
         living with partner   less than 9th grade         PAD630     120.000000 120.000000
                               no hs diploma               PAD630      93.521499 196.468590
                               hs grad or GED              PAD630      54.170121 143.460748
-------------------------------------------------------------------------------------------

The SURVEYMEANS Procedure


                         Domain Analysis: female*DMDMARTL*DMDEDUC2

female   DMDMARTL              DMDEDUC2                    Variable      95% CL for Mean
-------------------------------------------------------------------------------------------
female   living with partner   some college or AA degree   PAD630     102.490879 202.693622
                               college grad or above       PAD630      40.114391 137.908206
-------------------------------------------------------------------------------------------

By using proc print, we can see that there are only two cases that have a valid value for pad630 in subpopulation females who are living with a partner and have less than nine years of education, and both of those values are 120. This is why no standard error can be estimated.

proc print data = nhanes2012;
var pad630;
where female = 1 and dmdmartl = 6 and dmdeduc2 = 1;
run;

 Obs    PAD630

 344      120
 479        .
1339        .
1962        .
1987      120
2075        .
2148        .
2178        .
2631        .
2972        .
3148        .
4118        .
4595        .
6610        .
7064        .
7112        .
7214        .
7709        .
7829        .
8095        .
8479        .

Now let’s say that you want to compare the means from two domains. In this example, we get the mean of pad630 for females and males.

proc surveymeans data = nhanes2012;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
domain female;
var pad630;
format female fm.;
run;

The SURVEYMEANS Procedure

            Data Summary

Number of Strata                  14
Number of Clusters                31
Number of Observations          9756
Sum of Weights             306590681


                                   Statistics

                                               Std Error
Variable               N            Mean         of Mean       95% CL for Mean
---------------------------------------------------------------------------------
PAD630              2054      139.887377        5.579060    128.116590 151.658164
---------------------------------------------------------------------------------


                                  Domain Analysis: female

                                                         Std Error
female    Variable               N            Mean         of Mean       95% CL for Mean
-------------------------------------------------------------------------------------------
male      PAD630              1136      155.627196        7.008380    140.840807 170.413584
female    PAD630               918      121.684284        5.352345    110.391824 132.976744
-------------------------------------------------------------------------------------------

There are a few different ways that you could compare 155.63 and 121.68. In this example, we will use proc surveyreg and the contrast statement. Notice that the output in the section titled Estimated Regression Coefficients is almost identical to the output of the proc surveymeans above.

proc surveyreg data = nhanes2012;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
class female;
model pad630 = female / noint solution vadjust = none;
contrast 'comparing males and females' female 1 -1;
format female fm.;
run;

The SURVEYREG Procedure

Regression Analysis for Dependent Variable PAD630

            Data Summary

Number of Observations           2054
Sum of Weights               88768571
Weighted Mean of PAD630     139.88738
Weighted Sum of PAD630     1.24176E10


         Design Summary

Number of Strata              14
Number of Clusters            31


      Fit Statistics

R-square            0.5545
Root MSE            126.37
Denominator DF          17


       Class Level Information

CLASS
Variable      Levels    Values

female             2    female male


       Tests of Model Effects

Effect    Num DF    F Value    Pr > F

Model          2     325.56    <.0001
female         2     325.56    <.0001

NOTE: The denominator degrees of freedom for the F tests is 17.


               Estimated Regression Coefficients

                                 Standard
Parameter          Estimate         Error    t Value    Pr > |t|

female female    121.684284    5.35234462      22.73      <.0001
female male      155.627196    7.00837974      22.21      <.0001

NOTE: The denominator degrees of freedom for the t tests is 17.

Regression Analysis for Dependent Variable PAD630

                  Analysis of Contrasts

Contrast                       Num DF    F Value    Pr > F

comparing males and females         1      31.67    <.0001

NOTE: The denominator degrees of freedom for the F tests is 17.

In this example, we do the same analysis using proc surveymeans and the lsmeans statement. This example is adapted from code on the SAS website here.

proc surveyreg data = nhanes2012;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
class female;
model pad630 = female / noint solution vadjust = none;
lsmeans female / diff;
format female fm.;
run;

The SURVEYREG Procedure

Regression Analysis for Dependent Variable PAD630

            Data Summary

Number of Observations           2054
Sum of Weights               88768571
Weighted Mean of PAD630     139.88738
Weighted Sum of PAD630     1.24176E10


         Design Summary

Number of Strata              14
Number of Clusters            31


      Fit Statistics

R-square            0.5545
Root MSE            126.37
Denominator DF          17


       Class Level Information

CLASS
Variable      Levels    Values

female             2    female male


       Tests of Model Effects

Effect    Num DF    F Value    Pr > F

Model          2     325.56    <.0001
female         2     325.56    <.0001

NOTE: The denominator degrees of freedom for the F tests is 17.


               Estimated Regression Coefficients

                                 Standard
Parameter          Estimate         Error    t Value    Pr > |t|

female female    121.684284    5.35234462      22.73      <.0001
female male      155.627196    7.00837974      22.21      <.0001

NOTE: The denominator degrees of freedom for the t tests is 17.

Regression Analysis for Dependent Variable PAD630

                  female Least Squares Means

                      Standard
female    Estimate       Error       DF    t Value    Pr > |t|

female      121.68      5.3523       17      22.73      <.0001
male        155.63      7.0084       17      22.21      <.0001


                Differences of female Least Squares Means

                                 Standard
female    _female    Estimate       Error       DF    t Value    Pr > |t|

female    male       -33.9429      6.0314       17      -5.63      <.0001

If you square the t-value from this analysis, you will get the F-value given in the previous proc surveyreg analysis. The estimate of -33.94 is simply the difference of the means, 121.68 – 155.63 (with a little rounding error).

Regression

Now we will look at a few examples of regression analyses. We will use proc surveyreg and proc surveylogistic. The variables in these examples were chosen because they were either continuous or categorical, not because of data that they contain. In other words, the models shown here were not constructed to make substantive sense; rather, they were constructed to illustrate how certain things can be done. The variable pad630 is the number of minutes spent doing moderate-intensity activities on a typical work day.

proc surveyreg data = nhanes2012;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
class female;
model pad630 = female ridageyr / solution;
format female fm.;
run;

The SURVEYREG Procedure

Regression Analysis for Dependent Variable PAD630

            Data Summary

Number of Observations           2054
Sum of Weights               88768571
Weighted Mean of PAD630     139.88738
Weighted Sum of PAD630     1.24176E10


         Design Summary

Number of Strata              14
Number of Clusters            31


      Fit Statistics

R-Square           0.01933
Root MSE            126.29
Denominator DF          17


       Class Level Information

CLASS
Variable      Levels    Values

female             2    female male


         Tests of Model Effects

Effect       Num DF    F Value    Pr > F

Model             2      18.23    <.0001
Intercept         1     327.92    <.0001
female            1      30.26    <.0001
RIDAGEYR          1       4.97    0.0396

NOTE: The denominator degrees of freedom for the F tests is 17.


               Estimated Regression Coefficients

                                 Standard
Parameter          Estimate         Error    t Value    Pr > |t|

Intercept        167.289537    9.31899525      17.95      <.0001
female female    -33.125134    6.02167950      -5.50      <.0001
female male        0.000000    0.00000000        .         .
RIDAGEYR          -0.287604    0.12906252      -2.23      0.0396

NOTE: The degrees of freedom for the t tests is 17.
      Matrix X'WX is singular and a generalized inverse was used to solve the normal equations.
      Estimates are not unique.

There is a class statement in proc surveyreg (there isn’t one in proc reg), and, depending on the version of SAS/Stat that you are running, it does have many of the options that are found on the class statements in most other SAS procedures. The default in SAS is to use the highest-numbered category as the reference group. Hence, in previous example, the reference group is 1 (females), not 0 (males), for the variable female. In the example below, the category coded 0 is used as the reference group for both predictor variables, female and hsq571 (have you donated blood in the last year?). Besides using the options on the class statement, you can change the reference group by using a format such that the group that you want to be the reference group has the value label that comes first alphabetically. Notice on the model statement that the “|” symbol was used. This tells SAS to include both the main effects and the interaction in the model.

proc surveyreg data = nhanes2012;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
class female (ref = first) hsq571 (ref = '0');
* class female hsq571 / ref = first;
model pad630 = female|hsq571 ridageyr / solution;
run;

The SURVEYREG Procedure

Regression Analysis for Dependent Variable PAD630

            Data Summary

Number of Observations           1673
Sum of Weights               76183526
Weighted Mean of PAD630     145.83021
Weighted Sum of PAD630     1.11099E10


         Design Summary

Number of Strata              14
Number of Clusters            31


      Fit Statistics

R-Square           0.04574
Root MSE            126.20
Denominator DF          17


   Class Level Information

CLASS
Variable      Levels    Values

female             2    1 0
HSQ571             2    1 0


           Tests of Model Effects

Effect           Num DF    F Value    Pr > F

Model                 4      27.13     |t|
Intercept            209.155683    11.9883632      17.45

Below are a few examples of binary logistic regression. The variable paq665 asks if you do any moderate-intensity sports. A little data management is needed before we can run the logistic regression. Notice the options on the class statement and the model statement.

* logistic regression;
data nhanes2012b;
set nhanes2012;
age1 = 0;
if ridageyr > 20 then age1 = 1;
paq665 = paq665-1;
run;

* how you specify the reference group depends on whether or not you have a format;
* notice where the desc option goes;
proc surveylogistic data = nhanes2012b;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
class hsd010 (reference = '3') female (reference = 'male') / param = ref;
model paq665 (desc) = hsd010|female ridageyr;
format female fm.;
run;

The SURVEYLOGISTIC Procedure

                  Model Information

Data Set                      WORK.NHANES2012B
Response Variable             PAQ665
Number of Response Levels     2
Stratum Variable              SDMVSTRA
Number of Strata              14
Cluster Variable              SDMVPSU
Number of Clusters            31
Weight Variable               WTINT2YR
Model                         Binary Logit
Optimization Technique        Fisher's Scoring
Variance Adjustment           Degrees of Freedom (DF)


             Variance Estimation

Method                           Taylor Series
Variance Adjustment    Degrees of Freedom (DF)


Number of Observations Read        9756
Number of Observations Used        5890
Sum of Weights Read            3.0659E8
Sum of Weights Used            2.2591E8


                  Response Profile

 Ordered                      Total            Total
   Value       PAQ665     Frequency           Weight

       1            1          3347        117045323
       2            0          2543        108863614

Probability modeled is PAQ665=1.

NOTE: 3866 observations were deleted due to missing values for the response or explanatory
      variables.


           Class Level Information

Class      Value          Design Variables

HSD010     1           1      0      0      0
           2           0      1      0      0
           3           0      0      0      0
           4           0      0      1      0
           5           0      0      0      1

           Class Level Information

Class      Value          Design Variables

female     female      1
           male        0


                    Model Convergence Status

         Convergence criterion (GCONV=1E-8) satisfied.


         Model Fit Statistics

                             Intercept
              Intercept            and
Criterion          Only     Covariates

AIC           312879907      301730876
SC            312879914      301730950
-2 Log L      312879905      301730854


           Testing Global Null Hypothesis: BETA=0

Test                 F Value     Num DF     Den DF     Pr > F

Likelihood Ratio     1114905         10      Infty     <.0001
Score                  66.25         10         17     <.0001
Wald                   54.07         10         17     <.0001


                   Joint Tests

                               Wald
Effect             DF    Chi-Square    Pr > ChiSq

HSD010              4      107.3724        <.0001
female              1        0.2340        0.6286
HSD010*female       4        7.8696        0.0965
RIDAGEYR            1       19.3881        <.0001

NOTE: Under full-rank parameterizations, Type 3 effect tests are replaced by joint tests. The
      joint test for an effect is a test that all the parameters associated with that effect are
      zero. Such joint tests might not be equivalent to Type 3 effect tests under GLM
      parameterization.
                 Analysis of Maximum Likelihood Estimates

                                           Standard
Parameter                      Estimate       Error    t Value    Pr > |t|

Intercept                       -0.2185      0.1105      -1.98      0.0645
HSD010        1                 -0.3381      0.1390      -2.43      0.0263
HSD010        2                 -0.4097      0.1646      -2.49      0.0234
HSD010        4                  0.5232      0.1553       3.37      0.0036
HSD010        5                  1.4377      0.4666       3.08      0.0068
female        female             0.0682      0.1410       0.48      0.6348
HSD010*female 1      female     -0.4494      0.1744      -2.58      0.0196
HSD010*female 2      female     -0.1535      0.2124      -0.72      0.4796
HSD010*female 4      female     -0.1603      0.2996      -0.54      0.5996
HSD010*female 5      female     -0.1676      0.5184      -0.32      0.7503
RIDAGEYR                        0.00963     0.00219       4.40      0.0004

           NOTE: The degrees of freedom for the t tests is 17.


            Odds Ratio Estimates

               Point       95% Confidence
Effect      Estimate           Limits

RIDAGEYR       1.010       1.005       1.014

 NOTE: The degrees of freedom in computing
        the confidence limits is 17.


Association of Predicted Probabilities and Observed Responses

Percent Concordant       61.3    Somers' D    0.232
Percent Discordant       38.1    Gamma        0.234
Percent Tied              0.6    Tau-a        0.114
Pairs                 8511421    c            0.616

We can use the expb option on the model statement to get the odds ratios. We can use the clodds option to get the confidence limits around the odds ratios. SAS will provide a generalized R-square, but not all statisticians agree that this is appropriate. The variable hsq470 is the number of days in the last 30 days that physical health was not good, and hsq480 is the number of days in the past 30 days that mental health was not good.

proc surveylogistic data = nhanes2012b;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
model paq665 (desc) = ridageyr hsq470 hsq480  / expb clodds rsquare;
run;

The SURVEYLOGISTIC Procedure

                  Model Information

Data Set                      WORK.NHANES2012B
Response Variable             PAQ665
Number of Response Levels     2
Stratum Variable              SDMVSTRA
Number of Strata              14
Cluster Variable              SDMVPSU
Number of Clusters            31
Weight Variable               WTINT2YR
Model                         Binary Logit
Optimization Technique        Fisher's Scoring
Variance Adjustment           Degrees of Freedom (DF)


             Variance Estimation

Method                           Taylor Series
Variance Adjustment    Degrees of Freedom (DF)


Number of Observations Read        9756
Number of Observations Used        5873
Sum of Weights Read            3.0659E8
Sum of Weights Used            2.2552E8


                  Response Profile

 Ordered                      Total            Total
   Value       PAQ665     Frequency           Weight

       1            1          3334        116723574
       2            0          2539        108793841

Probability modeled is PAQ665=1.

NOTE: 3883 observations were deleted due to missing values for the response or explanatory
      variables.


                    Model Convergence Status

         Convergence criterion (GCONV=1E-8) satisfied.

         Model Fit Statistics

                             Intercept
              Intercept            and
Criterion          Only     Covariates

AIC           312354637      306355580
SC            312354644      306355607
-2 Log L      312354635      306355572


R-Square    1.0000    Max-rescaled R-Square    1.0000


           Testing Global Null Hypothesis: BETA=0

Test                 F Value     Num DF     Den DF     Pr > F

Likelihood Ratio     1999688          3      Infty     <.0001
Score                  12.07          3         17     0.0002
Wald                   13.15          3         17     0.0001


              Analysis of Maximum Likelihood Estimates

                         Standard
Parameter    Estimate       Error    t Value    Pr > |t|    Exp(Est)

Intercept     -0.5190      0.1022      -5.08      <.0001       0.595
RIDAGEYR       0.0106     0.00222       4.77      0.0002       1.011
HSQ470         0.0296     0.00755       3.92      0.0011       1.030
HSQ480         0.0124     0.00289       4.30      0.0005       1.013

        NOTE: The degrees of freedom for the t tests is 17.


Association of Predicted Probabilities and Observed Responses

Percent Concordant       57.5    Somers' D    0.159
Percent Discordant       41.6    Gamma        0.161
Percent Tied              0.9    Tau-a        0.078
Pairs                 8465026    c            0.580


      Odds Ratio Estimates and t Confidence Intervals

Effect           Unit     Estimate     95% Confidence Limits

RIDAGEYR       1.0000        1.011        1.006        1.015

         NOTE: The degrees of freedom in computing
      Odds Ratio Estimates and t Confidence Intervals

Effect           Unit     Estimate     95% Confidence Limits

HSQ470         1.0000        1.030        1.014        1.047
HSQ480         1.0000        1.013        1.006        1.019

         NOTE: The degrees of freedom in computing
                the confidence limits is 17.

Using proc glimmix

The following example is copied directly from the SAS website because I don’t have any good data to use for an example. Please see this page for this example and more information. Besides the comments that SAS makes below, there are a few things that I would like to point out. First, notice that you MUST supply two weight variables: a weight for level 1 and a weight for level 2. This is not an inconvenience of using SAS; rather, this is true of running any type of multilevel model in any statistical package. You need to do this because the level 1 sampling weights and the level 2 sampling weights enter into the multilevel model equation in different places. Having the two sampling weights is often a problem with public-use survey data sets, because the data are often not released with level 1 and level 2 weights. The other issue that you need to know about is the scaling of the level 1 sampling weights. This is not an issue in single-level models, but it is an issue in multilevel models. At this time, SAS does not have an option to scale the weights for you; rather, you need to do it yourself in a data step before you run proc glimmix. Please see Pfeffermann, et. al. (1998) and Rabe-Hesketh and Skrondal (2006) for more information.


proc glimmix data=dws method=quadrature empirical=classical;
   class id;
   model y = x1 x2 / dist=binomial link=probit obsweight=sw1 solution;
   random int  / subject=id weight=w2;
run;

To fit a weighted multilevel model, you should use METHOD=QUAD. The EMPIRICAL=CLASSICAL option in the PROC GLIMMIX statement instructs PROC GLIMMIX to compute the empirical (sandwich) variance estimators for the fixed effect and the variance. The empirical variance estimators are recommended for the inference about fixed effects and variance estimated by pseudo-likelihood.

Carle (2009) provides the SAS and Stata code for the two most common methods of scaling the level 1 weights in Appendix B of his paper Fitting Multilevel Models in Complex Survey Data with Design Weights: Recommendations. One method scales the level 1 weight to the sample size within each cluster; the other method scales the level 1 weight to the effective sample size. There is currently no recommendation about when to use either type of scaling; rather, the recommendation is to do a sensitivity analysis comparing both methods.

A few words about proc surveyimpute

The following is quoted from the SAS documentation: http://support.sas.com/documentation/cdl/en/statug/68162/HTML/default/viewer.htm#statug_surveyimpute_overview.htm

The SURVEYIMPUTE procedure imputes missing values of an item in a data set by replacing them with observed values from the same item. The principles by which the imputation is performed are particularly useful for survey data. PROC SURVEYIMPUTE also computes replicate weights (such as jackknife weights) that account for the imputation and that can be used for replication-based variance estimation for complex surveys. The procedure implements a fractional hot-deck imputation technique (Kim and Fuller 2004; Fuller 2009; Kim and Shao 2014) in addition to some traditional hot-deck imputation techniques (Andridge and Little 2010).

Nonresponse is a common problem in almost all surveys of human populations. Estimators that are based on survey data that include nonresponse can suffer from nonresponse bias if the nonrespondents are different from the respondents. Estimators that use complete cases (only the observed units) might also be less precise. Imputation techniques are important tools for reducing nonresponse bias and producing efficient estimators.

The main objectives of any imputation technique are to eliminate the nonresponse bias and to provide an imputed data set that results in consistent analyses conducted with the imputed data. In addition, a variance estimator must be available that accounts for both the sampling variance and the imputation variance. Imputation techniques use implicit or explicit models. Some model-based imputation techniques include multiple imputation, mean imputation, and regression imputation. For more information about multiple imputation in SAS/STAT, see Chapter 75: The MI Procedure, and Chapter 76: The MIANALYZE Procedure.

Imputation techniques that do not use explicit models include hot-deck imputation, cold-deck imputation, and fractional imputation. PROC SURVEYIMPUTE implements imputation techniques that do not use explicit models. It also produces replicate weights that can be used with any survey analysis procedure in SAS/STAT to estimate both the sampling variability and the imputation variability.

Hot-deck imputation is the most commonly used imputation technique for survey data. A donor is selected for a recipient unit, and the observed values of the donor are imputed for the missing items of the recipient. Although the imputation method is straightforward, the variance estimator that accounts for imputation variance might not be simple and is often ignored in practice. PROC SURVEYIMPUTE does not create imputation-adjusted replicate weights for hot-deck imputation.

Fractional hot-deck imputation (Kalton and Kish 1984; Fay 1996; Kim and Fuller 2004; Fuller and Kim 2005), also known as fractional imputation (FI), is a variation of hot-deck imputation in which one missing item for a recipient is imputed from multiple donors. Each donor donates a fraction of the original weight of the recipient such that the sum of the fractional weights from all the donors is equal to the original weight of the recipient. For fully efficient fractional imputation (FEFI), all observed values in an imputation cell are used as donors for a recipient unit in that cell (Kim and Fuller 2004).

The SURVEYIMPUTE procedure implements single and multiple hot-deck imputation and FEFI. Available donor selection techniques include simple random selection with or without replacement, probability proportional to weights selection (Rao and Shao 1992), and approximate Bayesian bootstrap selection (Rubin and Schenker 1986).

End quote.

A great deal of work has been done with respect to imputation methods for complex survey data. While this topic is beyond the scope of this workshop, interested readers may want to see

Andridge Rebecca R. and Roderick J. Little. (2009). The Use of Sample Weights in Hot Deck Imputation. Journal of Official Statistics; 25(1): 21-36.

and

Bell, Bethany A., Kromrey, Jeffrey D., and Ferron, John M. (2009). Section on Survey Research Methods, JSM 2009.

For more information on using the NHANES data sets

There are helpful resources for learning how to analyze the NHANES data sets correctly. One is a listserv at http://www.cdc.gov/nchs/nhanes/nhanes_listserv.htm . There are also online tutorials at http://www.cdc.gov/nchs/tutorials/index.htm .

References

Applied Survey Data Analysis by Steven G. Heeringa, Brady T. West, and Patricia A. Berglund

Analysis of Health Surveys by Edward L. Korn and Barry I. Graubard

Sampling of Populations: Methods and Applications, Fourth Edition by Paul Levy and Stanley Lemeshow

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner

Sampling Techniques, Third Edition by William G. Cochran

Pfeffermann, D., Skinner, C. J., Holmes, D. J., Goldstein, H., and Rasbash, J. (1998), Weighting for Unequal Selection Probabilities in Multilevel Models, Journal of the Royal Statistical Society, Series B, 60, 23-40.

Rabe-Hesketh, S. and Skrondal, A. (2006), Multilevel Modelling of Complex Survey Data, Journal of the Royal Statistical Society, Series A, 169, 805-827.

Carle, Adam C. (2009). Fitting Multilevel Models in Complex Survey Data with Design Weights: Recommendations. BMC Medical Research Methodology; 9(49).

Quartagno, M., Carpenter, R., and Goldstein, H.. (2019). Multiple Imputation with Survey Weights: A Multilevel Approach. Journal of Survey Statistics and Methodology, Volume 8, Issue 5, November 2020, Pages 965–989, https://doi.org/10.1093/jssam/smz036