SAS code file for live presentation (right-click to download)
The purpose of this workshop is to explore some issues in the analysis of survey data using SAS 9.44 and SAS/Stat 15.2. Most of code shown in this seminar will work in earlier versions of SAS and SAS/Stat. To find out what version of SAS and SAS/Stat you are running, open SAS and look at the information in the log file.
NOTE: SAS (r) Proprietary Software 9.4 (TS1M7)
NOTE: Updated analytical products:
SAS/STAT 15.2
SAS/ETS 15.2
SAS/OR 15.2
SAS/IML 15.2
SAS/QC 15.2
There are seven survey procedures.
proc surveyselect: This procedure can be used to select a sample from a dataset.
proc surveyimpute: This procedure can be used to do single imputations on a survey dataset.
proc surveymeans: This procedure can be used to obtain weighted descriptive statistics for continuous variables. This procedure can produce graphs.
proc surveyfreq: This procedure can be used to run weighted one-way and multi-way crosstabulations. This procedure can produce graphs.
proc surveyregress: This procedure can be used to run weighted OLS regressions.
proc surveylogistic: This procedure can be used to run weighted logistic, ordinal, multinomial and probit regressions.
proc surveyphreg: This procedure can be used to run weighted proportional hazards regression.
We will also briefly discuss proc glimmix.
proc glimmix: This procedure will allow for sampling weights, so it can be used to run weighted multilevel models. This procedure does not have a strata, cluster or a domain statement, and it does not allow for replicate weights. It requires that a sampling weight be specified at each level of the model.
Why do we need survey data analysis software?
Regular statistical software (that is not designed for survey data) analyzes data as if the data were collected using simple random sampling. For experimental and quasi-experimental designs, this is exactly what we want. However, very few surveys use a simple random sample to collect data. Not only is it nearly impossible to do so, but it is not as efficient (either financially and statistically) as other sampling methods. When any sampling method other than simple random sampling is used, we usually need to use survey data analysis software to take into account the differences between the design that was used to collect the data and simple random sampling. This is because the sampling design affects both the calculation of the point estimates and the standard errors of those estimates. If you ignore the sampling design, e.g., if you assume simple random sampling when another type of sampling design was used, both the point estimates and their standard errors will likely be calculated incorrectly. The sampling weight will affect the calculation of the point estimate, and the stratification and/or clustering will affect the calculation of the standard errors. Ignoring the clustering will likely lead to standard errors that are underestimated, possibly leading to results that seem to be statistically significant, when in fact, they are not. The difference in point estimates and standard errors obtained using non-survey software and survey software with the design properly specified will vary from data set to data set, and even between analyses using the same data set. While it may be possible to get reasonably accurate results using non-survey software, there is no practical way to know beforehand how far off the results from non-survey software will be.
Sampling designs
Most people do not conduct their own surveys. Rather, they use survey data that some agency or company collected and made available to the public. The documentation must be read carefully to find out what kind of sampling design was used to collect the data. This is very important because many of the estimates and standard errors are calculated differently for the different sampling designs. Hence, if you mis-specify the sampling design, the point estimates and standard errors will likely be wrong.
Below are some common features of many sampling designs.
Sampling weights: There are several types of weights that can be associated with a survey. Perhaps the most common is the sampling weight. A sampling weight is a probability weight that has had one or more adjustments made to it. Both a sampling weight and a probability weight are used to weight the sample back to the population from which the sample was drawn. By definition, a probability weight is the inverse of the probability of being included in the sample due to the sampling design (except for a certainty PSU, see below). The probability weight, called a pweight in Stata, is calculated as N/n, where N = the number of elements in the population and n = the number of elements in the sample. For example, if a population has 10 elements and 3 are sampled at random with replacement, then the probability weight would be 10/3 = 3.33. In a two-stage design, the probability weight is calculated as f1f2, which means that the inverse of the sampling fraction for the first stage is multiplied by the inverse of the sampling fraction for the second stage. Under many sampling plans, the sum of the probability weights will equal the population total.
While many textbooks will end their discussion of probability weights here, this definition does not fully describe the sampling weights that are included with actual survey data sets. Rather, the sampling weight, which is sometimes called a “final weight,” starts with the inverse of the sampling fraction, but then incorporates several other values, such as corrections for unit non-response, errors in the sampling frame (sometimes called non-coverage), and poststratification. Because these other values are included in the probability weight that is included with the data set, it is often inadvisable to modify the sampling weights, such as trying to standardize them for a particular variable, e.g., age.
PSU: This is the primary sampling unit. This is the first unit that is sampled in the design. For example, school districts from California may be sampled and then schools within districts may be sampled. The school district would be the PSU. If states from the US were sampled, and then school districts from within each state, and then schools from within each district, then states would be the PSU. One does not need to use the same sampling method at all levels of sampling. For example, probability-proportional-to-size sampling may be used at level 1 (to select states), while cluster sampling is used at level 2 (to select school districts). In the case of a simple random sample, the PSUs and the elementary units are the same. In general, accounting for the clustering in the data (i.e., using the PSUs), will increase the standard errors of the point estimates. Conversely, ignoring the PSUs will tend to yield standard errors that are too small, leading to false positives when doing significance tests.
Strata: Stratification is a method of breaking up the population into different groups, often by demographic variables such as gender, race or SES. Each element in the population must belong to one, and only one, strata. Once the strata have been defined, samples are taken from each stratum as if it were independent of all of the other strata. For example, if a sample is to be stratified on gender, men and women would be sampled independently of one another. This means that the probability weights for men will likely be different from the probability weights for the women. In most cases, you need to have two or more PSUs in each stratum. The purpose of stratification is to reduce the standard error of the estimates, and stratification works most effectively when the variance of the dependent variable is smaller within the strata than in the sample as a whole.
FPC: This is the finite population correction. This is used when the sampling fraction (the number of elements or respondents sampled relative to the population) becomes large. The FPC is used in the calculation of the standard error of the estimate. If the value of the FPC is close to 1, it will have little impact and can be safely ignored. In some survey data analysis programs, such as SUDAAN, this information will be needed if you specify that the data were collected without replacement (see below for a definition of “without replacement”). The formula for calculating the FPC is ((N-n)/(N-1))1/2, where N is the number of elements in the population and n is the number of elements in the sample. To see the impact of the FPC for samples of various proportions, suppose that you had a population of 10,000 elements.
Sample size (n) FPC 1 1.0000 10 .9995 100 .9950 500 .9747 1000 .9487 5000 .7071 9000 .3162
Replicate weights: Replicate weights are a series of weight variables that are used to correct the standard errors for the sampling plan. They serve the same function as the PSU and strata variables (which are used a Taylor series linearization) to correct the standard errors of the estimates for the sampling design. Many public use data sets are now being released with replicate weights instead of PSUs and strata in an effort to more securely protect the identity of the respondents. In theory, the same standard errors will be obtained using either the PSU and strata or the replicate weights. There are different ways of creating replicate weights; the method used is determined by the sampling plan. The most common are balanced repeated and jackknife replicate weights. You will need to read the documentation for the survey data set carefully to learn what type of replicate weight is included in the data set; specifying the wrong type of replicate weight will likely lead to incorrect standard errors. For more information on replicate weights, please see Stata Library: Replicate Weights and Appendix D of the WesVar Manual by Westat, Inc. Several statistical packages, including Stata, SAS, SUDAAN, WesVar and R, allow the use of replicate weights.
Consequences of not using the design elements
Sampling design elements include the sampling weights, post-stratification weights (if provided), PSUs, strata, and replicate weights. Rarely are all of these elements included in a single public-use data set. However, ignoring the design elements that are included can often lead to inaccurate point estimates and/or inaccurate standard errors.
Sampling with and without replacement
Most samples collected in the real world are collected “without replacement”. This means that once a respondent has been selected to be in the sample and has participated in the survey, that particular respondent cannot be selected again to be in the sample. Many of the calculations change depending on if a sample is collected with or without replacement. Hence, programs like SUDAAN request that you specify if a survey sampling design was implemented with our without replacement, and an FPC is used if sampling without replacement is used, even if the value of the FPC is very close to one.
Examples
For the examples in this workshop, we will use the data set from NHANES 2011-2012. The data set and documentation can be downloaded from the NHANES web site. The data files can be downloaded as SAS.xpt files.
Reading the documentation
The first step in analyzing any survey data set is to read the documentation. With many of the public use data sets, the documentation can be quite extensive and sometimes even intimidating. Instead of trying to read the documentation “cover to cover”, there are some parts you will want to focus on. First, read the Introduction. This is usually an “easy read” and will orient you to the survey. There is usually a section or chapter called something like “Sample Design and Analysis Guidelines”, “Variance Estimation”, etc. This is the part that tells you about the design elements included with the survey and how to use them. Some even give example code. If multiple sampling weights have been included in the data set, there will be some instruction about when to use which one. If there is a section or chapter on missing data or imputation, please read that. This will tell you how missing data were handled. You should also read any documentation regarding the specific variables that you intend to use. As we will see little later on, we will need to look at the documentation to get the value labels for the variables. This is especially important because some of the values are actually missing data codes, and you need to do something so that SAS doesn’t treat those as valid values (or you will get some very “interesting” means, totals, etc.).
The variables
We will use about a dozen different variables in the examples in this workshop. Below is a brief summary of them. Some of the variables have been recoded to be binary variables (values of 2 recoded to a value of 0). The count of missing observations includes values truly missing as well as refused and don’t know.
ridageyr – Age in years at exam – recoded; range of values: 0 – 79 are actual values, 80 = 80+ years of age
pad630 – How much time do you spend doing moderate-intensity activities on a type work day?; range of values: 10-960 (minutes), 7053 missing observations
hsq496 – During the past 30 days, for about how many days have you felt worried, tense or anxious?; range of values: 0-30; 3073 missing observations
female – Recode of the variable riagendr; 0 = male, 1 = female; no missing observations
dmdborn4 – Country of birth; 1 = born in the United States, 0 = otherwise; 5 missing observations
dmdmartl – Marital status; 1 = married, 2 = widowed, 3 = divorced, 4 = separated, 5 = never married, 6 = living with partner; 4203 missing observations
dmdeduc2 – Education level of adults aged 20+ years; 1 = less than 9th grade, 2 = 9-11th grade, 3 = high school graduate, GED or equivalent, 4 = some college or AA degree, 5 = college graduate or above; 4201 missing observations
pad675 – How much time do you spend doing moderate-intensity sports, fitness, or recreation activities on a typical day?; range of values: 10-600 (minutes); 6220 missing observations
hsq571 – During the past 12 months, have you donated blood?; 0 = no, 1 = yes; 3673 missing observations
pad680 – How much time do you usually spend sitting on a typical day?; range of values: 0-1380 (minutes); 2365 missing observations
paq665 – Do you do any moderate-intensity sports, fitness or recreational activities that cause a small increase in breathing or heart rate at least 10 minutes continually?; 0 = no, 1 = yes; 2329 missing observations
hsd010 – Would you say that your general health is…; 1 = excellent, 2 = very good, 3 = good, 4 = fair, 5 = poor; 3064 missing observations
hsq470 – number of days in the last 30 days that physical health is not good; range of values: 0 – 30 (days), 3075 missing observations
hsq480 – number of days in the last 30 days that mental health is not good; range of values: 0 – 30 (days), 3073 missing observations
There are three other variables that we should identify. One is the sampling weight variable. It is wtint2yr. The cluster variable is sdmvpsu and the stratification variable is sdmvstra. There are 14 strata and 31 clusters in this dataset. Let’s briefly look at each of these variables.
proc means data = nhanes2012 n min mean max sum;
var wtint2yr;
run;
The MEANS Procedure
Analysis Variable : WTINT2YR
N Minimum Mean Maximum Sum
--------------------------------------------------------------------
9756 3320.89 31425.86 220233.32 306590681
--------------------------------------------------------------------
We see that there are 9756 observations in the dataset. The average weight is 31425.86, with a minimum of 3320.89 and a maximum of 220233.32. What does this mean? Each row of data in this dataset has a value for the sampling weight. The person who contributed that row of data represents that many people in the population. What is “the population?” Quoting from the NHANES documentation ( NHANES 2011-2012 Overview ). The NHANES target population is the noninstitutionalized civilian resident population of the United States. The sum of the weights, 306,590,681, is the estimated number of people in the population. However, if you look at other sources for the population of the United States in 2012, you will see something like 314.1 million.
Now let’s look at the cluster and strata variables.
proc freq data = nhanes2012;
tables sdmvpsu sdmvstra;
run;
The FREQ Procedure
Cumulative Cumulative
SDMVPSU Frequency Percent Frequency Percent
------------------------------------------------------------
1 4374 44.83 4374 44.83
2 4490 46.02 8864 90.86
3 892 9.14 9756 100.00
Cumulative Cumulative
SDMVSTRA Frequency Percent Frequency Percent
-------------------------------------------------------------
90 862 8.84 862 8.84
91 998 10.23 1860 19.07
92 875 8.97 2735 28.03
93 602 6.17 3337 34.20
94 688 7.05 4025 41.26
95 722 7.40 4747 48.66
96 676 6.93 5423 55.59
97 608 6.23 6031 61.82
98 708 7.26 6739 69.08
99 682 6.99 7421 76.07
100 700 7.18 8121 83.24
101 715 7.33 8836 90.57
102 624 6.40 9460 96.97
103 296 3.03 9756 100.00
proc freq data = nhanes2012;
tables sdmvpsu*sdmvstra;
run;
The FREQ Procedure
Table of SDMVPSU by SDMVSTRA
SDMVPSU SDMVSTRA
Frequency|
Percent |
Row Pct |
Col Pct | 90| 91| 92| 93| 94| 95| 96| Total
---------+--------+--------+--------+--------+--------+--------+--------+
1 | 278 | 309 | 328 | 276 | 322 | 348 | 336 | 4374
| 2.85 | 3.17 | 3.36 | 2.83 | 3.30 | 3.57 | 3.44 | 44.83
| 6.36 | 7.06 | 7.50 | 6.31 | 7.36 | 7.96 | 7.68 |
| 32.25 | 30.96 | 37.49 | 45.85 | 46.80 | 48.20 | 49.70 |
---------+--------+--------+--------+--------+--------+--------+--------+
2 | 351 | 333 | 244 | 326 | 366 | 374 | 340 | 4490
| 3.60 | 3.41 | 2.50 | 3.34 | 3.75 | 3.83 | 3.49 | 46.02
| 7.82 | 7.42 | 5.43 | 7.26 | 8.15 | 8.33 | 7.57 |
| 40.72 | 33.37 | 27.89 | 54.15 | 53.20 | 51.80 | 50.30 |
---------+--------+--------+--------+--------+--------+--------+--------+
3 | 233 | 356 | 303 | 0 | 0 | 0 | 0 | 892
| 2.39 | 3.65 | 3.11 | 0.00 | 0.00 | 0.00 | 0.00 | 9.14
| 26.12 | 39.91 | 33.97 | 0.00 | 0.00 | 0.00 | 0.00 |
| 27.03 | 35.67 | 34.63 | 0.00 | 0.00 | 0.00 | 0.00 |
---------+--------+--------+--------+--------+--------+--------+--------+
Total 862 998 875 602 688 722 676 9756
8.84 10.23 8.97 6.17 7.05 7.40 6.93 100.00
(Continued)
Frequency|
Percent |
Row Pct |
Col Pct | 97| 98| 99| 100| 101| 102| 103| Total
---------+--------+--------+--------+--------+--------+--------+--------+
1 | 316 | 388 | 362 | 343 | 358 | 270 | 140 | 4374
| 3.24 | 3.98 | 3.71 | 3.52 | 3.67 | 2.77 | 1.44 | 44.83
| 7.22 | 8.87 | 8.28 | 7.84 | 8.18 | 6.17 | 3.20 |
| 51.97 | 54.80 | 53.08 | 49.00 | 50.07 | 43.27 | 47.30 |
---------+--------+--------+--------+--------+--------+--------+--------+
2 | 292 | 320 | 320 | 357 | 357 | 354 | 156 | 4490
| 2.99 | 3.28 | 3.28 | 3.66 | 3.66 | 3.63 | 1.60 | 46.02
| 6.50 | 7.13 | 7.13 | 7.95 | 7.95 | 7.88 | 3.47 |
| 48.03 | 45.20 | 46.92 | 51.00 | 49.93 | 56.73 | 52.70 |
---------+--------+--------+--------+--------+--------+--------+--------+
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 892
| 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 9.14
| 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
---------+--------+--------+--------+--------+--------+--------+--------+
Total 608 708 682 700 715 624 296 9756
6.23 7.26 6.99 7.18 7.33 6.40 3.03 100.00
This tells us is that there are two clusters (AKA PSUs) per strata. This is pretty typical for a survey dataset. The numbering of the clusters and strata does not matter in most statistical software packages.
Descriptive statistics
We will start by calculating some descriptive statistics of some of the continuous variables. We will use proc surveymeans to get some basic information regarding the continuous variable ridageyr.
* descriptives with a continuous variable; proc surveymeans data = nhanes2012; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; var ridageyr; run;
The SURVEYMEANS Procedure
Data Summary
Number of Strata 14
Number of Clusters 31
Number of Observations 9756
Sum of Weights 306590681
Statistics
Std Error
Variable N Mean of Mean 95% CL for Mean
---------------------------------------------------------------------------------
RIDAGEYR 9756 37.185195 0.696477 35.7157572 38.6546320
---------------------------------------------------------------------------------
We see some familiar numbers in this output. We see the 14 strata, 31 clusters, 9756 observations, and the estimated population total of 306,590,681.
There are many options that you can use. The options are usually included on the proc statement. The range option gives the range, which is the maximum minus the minimum.
* with some options; proc surveymeans data = nhanes2012 min mean max range; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; var ridageyr; run;
The SURVEYMEANS Procedure
Data Summary
Number of Strata 14
Number of Clusters 31
Number of Observations 9756
Sum of Weights 306590681
Statistics
Std Error
Variable Minimum Maximum Range Mean of Mean
----------------------------------------------------------------------------------------
RIDAGEYR 0 80.000000 80.000000 37.185195 0.696477
----------------------------------------------------------------------------------------
Notice that the output includes only the statistics requested on the proc surveymeans statement.
proc surveymeans data = nhanes2012 quartiles; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; var ridageyr; run;
The SURVEYMEANS Procedure
Data Summary
Number of Strata 14
Number of Clusters 31
Number of Observations 9756
Sum of Weights 306590681
Quantiles
Variable Percentile Estimate Std Error 95% Confidence Limits
----------------------------------------------------------------------------------
RIDAGEYR 25% Q1 17.514078 0.580720 16.2888667 18.7392897
50% Median 36.205625 1.394588 33.2633021 39.1479485
75% Q3 54.651947 0.925135 52.7000816 56.6038114
----------------------------------------------------------------------------------
* other options include deciles, quartiles, median, q1, q3, and specific values; proc surveymeans data = nhanes2012 percentile = (10 25 50 75 90); weight wtint2yr; cluster sdmvpsu; strata sdmvstra; var ridageyr; run;
The SURVEYMEANS Procedure
Data Summary
Number of Strata 14
Number of Clusters 31
Number of Observations 9756
Sum of Weights 306590681
Quantiles
Variable Percentile Estimate Std Error 95% Confidence Limits
----------------------------------------------------------------------------------
RIDAGEYR 10% D1 6.557412 0.318603 5.8852178 7.2296057
25% Q1 17.514078 0.580720 16.2888667 18.7392897
50% Median 36.205625 1.394588 33.2633021 39.1479485
75% Q3 54.651947 0.925135 52.7000816 56.6038114
90% D9 67.552258 1.029785 65.3796014 69.7249153
----------------------------------------------------------------------------------
In the example below, five options are specified. The nmiss option shows the number of missing values for the variable pad630 (How much time do you spend doing moderate-intensity activities on a type work day?). The df option shows the degrees of freedom used. The degrees of freedom are equal to the number of clusters (PSUs) minus the number of strata. In this example, 31 – 14 = 17. The cv option gives the coefficient of variation, which is the standard deviation divided by the mean. The geomean option gives the geometric mean, which is the nth root of n numbers. It is sometimes used when combining items that have different ranges. The gmstderr option gives the standard error of the geometric mean.
* using some options; proc surveymeans data = nhanes2012 nmiss df cv geomean gmstderr; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; var pad630; run;
The SURVEYMEANS Procedure
Data Summary
Number of Strata 14
Number of Clusters 31
Number of Observations 9756
Sum of Weights 306590681
Statistics
Coeff of
Variable N Miss DF Variation
--------------------------------------------------
PAD630 7702 17 0.039883
--------------------------------------------------
Geometric Means
Geometric
Variable Mean Std Error
----------------------------------------
PAD630 90.048920 3.648271
----------------------------------------
Notice that SAS does not do a listwise deletion of missing values across all of the variables listed on the var statement. (Notice that the N is different for each of the three variables listed in the output.)
* does not do a listwise deletion across multiple variables; proc surveymeans data = nhanes2012; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; var ridageyr pad630 hsq496; run;
The SURVEYMEANS Procedure
Data Summary
Number of Strata 14
Number of Clusters 31
Number of Observations 9756
Sum of Weights 306590681
Statistics
Std Error
Variable N Mean of Mean 95% CL for Mean
---------------------------------------------------------------------------------
RIDAGEYR 9756 37.185195 0.696477 35.715757 38.654632
PAD630 2054 139.887377 5.579060 128.116590 151.658164
HSQ496 5883 5.383908 0.189951 4.983147 5.784669
---------------------------------------------------------------------------------
The ODS graphics must be turned on for SAS to produce the graphs. If you do not submit the ods graphics on; statement, SAS will give you all of the output from proc surveymeans except for the graphs. There will be a warning in the log file indicating that ODS graphics must be turned on in order to get the graphs.
* getting some graphs; ods graphics on; proc surveymeans data = nhanes2012 plots = all; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; var ridageyr; run; ods graphics off;
<output omitted>
The diamond is the mean (37.19), and the line is the median (36.21).
* getting one graph at a time; ods graphics on; proc surveymeans data = nhanes2012 plots = boxplot; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; var ridageyr; run; ods graphics off;
* getting just the histogram; ods graphics on; proc surveymeans data = nhanes2012 plots = histogram; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; var ridageyr; run; ods graphics off;
There are a few different ways to get descriptive statistics with categorical variables. You can use proc surveymeans if your variable is binary (i.e., coded 0/1).
* descriptives with a binary variable; * this is actually a proportion; proc surveymeans data = nhanes2012; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; var female; run;
The SURVEYMEANS Procedure
Data Summary
Number of Strata 14
Number of Clusters 31
Number of Observations 9756
Sum of Weights 306590681
Statistics
Std Error
Variable N Mean of Mean 95% CL for Mean
---------------------------------------------------------------------------------
female 9756 0.511952 0.006440 0.49836568 0.52553917
---------------------------------------------------------------------------------
Probably the most common procedure for getting descriptive statistics for categorical variables is proc surveyfreq. The tables statement in proc surveyfreq works the same way that the tables statement in proc freq works.
* this might be more common; proc surveyfreq data = nhanes2012; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; tables female; run;
The SURVEYFREQ Procedure
Data Summary
Number of Strata 14
Number of Clusters 31
Number of Observations 9756
Sum of Weights 306590681
Table of female
Weighted Std Dev of Std Err of
female Frequency Frequency Wgt Freq Percent Percent
-------------------------------------------------------------------------
0 4856 149630839 8783128 48.8048 0.6440
1 4900 156959842 11234711 51.1952 0.6440
Total 9756 306590681 19723273 100.000
-------------------------------------------------------------------------
You may want to use formats to help label the output.
* using formats; proc surveyfreq data = nhanes2012; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; tables female*dmdborn4; format female fm. dmdborn4 cb.; run;
The SURVEYFREQ Procedure
Data Summary
Number of Strata 14
Number of Clusters 31
Number of Observations 9756
Sum of Weights 306590681
Table of female by DMDBORN4
Weighted Std Dev of Std Err of
female DMDBORN4 Frequency Frequency Wgt Freq Percent Percent
-------------------------------------------------------------------------------------------
male born elsewhere 1027 22449131 2368069 7.3247 0.8655
born in US 3825 127102676 8587374 41.4710 0.6134
Total 4852 149551807 8773197 48.7957 0.6428
-------------------------------------------------------------------------------------------
female born elsewhere 1056 23543299 1926175 7.6817 0.7451
born in US 3843 133390830 11273589 43.5227 1.2458
Total 4899 156934129 11235137 51.2043 0.6428
-------------------------------------------------------------------------------------------
Total born elsewhere 2083 45992430 4177655 15.0064 1.5756
born in US 7668 260493506 19670647 84.9936 1.5756
Total 9751 306485936 19715992 100.000
-------------------------------------------------------------------------------------------
Frequency Missing = 5
In the next example, several options were used. The expected option gives the expected frequencies for each cell in the table. The row option gives the row percentages. The col option gives the column percentages. The chisq option gives the Rao-Scott chi-square test; lrchisq option gives the likelihood ratio chi-square test; the wchisq option gives the Wald chi-square test; the wllchisq option gives the Wald log-linear chi-square test.
proc surveyfreq data = nhanes2012; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; tables female*dmdborn4 / expected row col chisq lrchisq wchisq wllchisq; format female fm. dmdborn4 cb.; run;
The SURVEYFREQ Procedure
Data Summary
Number of Strata 14
Number of Clusters 31
Number of Observations 9756
Sum of Weights 306590681
Table of female by DMDBORN4
Weighted Std Dev of Expected Std Err of
female DMDBORN4 Frequency Frequency Wgt Freq Wgt Freq Percent Percent
--------------------------------------------------------------------------------------------------
male born elsewhere 1027 22449131 2368069 22442305 7.3247 0.8655
born in US 3825 127102676 8587374 127109501 41.4710 0.6134
Total 4852 149551807 8773197 48.7957 0.6428
--------------------------------------------------------------------------------------------------
female born elsewhere 1056 23543299 1926175 23550124 7.6817 0.7451
born in US 3843 133390830 11273589 133384005 43.5227 1.2458
Total 4899 156934129 11235137 51.2043 0.6428
--------------------------------------------------------------------------------------------------
Total born elsewhere 2083 45992430 4177655 15.0064 1.5756
born in US 7668 260493506 19670647 84.9936 1.5756
Total 9751 306485936 19715992 100.000
--------------------------------------------------------------------------------------------------
Frequency Missing = 5
Table of female by DMDBORN4
Row Std Err of Column Std Err of
female DMDBORN4 Percent Row Percent Percent Col Percent
--------------------------------------------------------------------------
male born elsewhere 15.0109 1.6401 48.8105 1.2315
born in US 84.9891 1.6401 48.7930 0.6756
Total 100.000
--------------------------------------------------------------------------
female born elsewhere 15.0020 1.5769 51.1895 1.2315
born in US 84.9980 1.5769 51.2070 0.6756
Total 100.000
--------------------------------------------------------------------------
Total born elsewhere 100.000
born in US 100.000
Total
--------------------------------------------------------------------------
Frequency Missing = 5
Table of female by DMDBORN4
Rao-Scott Chi-Square Test
Pearson Chi-Square 0.0002
Design Correction 0.7893
Rao-Scott Chi-Square 0.0002
DF 1
Pr > ChiSq 0.9889
F Value 0.0002
Num DF 1
Den DF 17
Pr > F 0.9891
Sample Size = 9751
Rao-Scott Likelihood Ratio Test
Likelihood Ratio Chi-Square 0.0002
Design Correction 0.7893
Rao-Scott Chi-Square 0.0002
DF 1
Pr > ChiSq 0.9889
F Value 0.0002
Num DF 1
Den DF 17
Pr > F 0.9891
Sample Size = 9751
Wald Chi-Square Test
Chi-Square 0.0002
F Value 0.0002
Num DF 1
Den DF 17
Pr > F 0.9891
Sample Size = 9751
Table of female by DMDBORN4 Wald Log-Linear Chi-Square Test Chi-Square 0.0002 F Value 0.0002 Num DF 1 Den DF 17 Pr > F 0.9891 Sample Size = 9751
In the next example, we will use three options. The cv option displays coefficients of variation for percentages. The definition of “coefficient of variation” is that it is the standard deviation / mean, or, in our case, the standard error divided by the point estimate. For example, for males born elsewhere for the percentage, .8655/7.3247 = .1182. The cvwt option displays coefficients of variation for weighted frequencies. For example, for males born elsewhere: 2368069/22449131 = .1055. The deff option displays design effects for percentages. This attempts to quantify the extent to which the observed sampling error differs from what would be expected if SRS had been used. It is defined as variance(observed) / variance(SRS). It can be thought of as a measure of efficiency. If the design effect is 1, then the current analysis with the current sampling plan is as efficient as the same analysis using a SRS. If the design effect is less than 1, then the current analysis with the current sample is more efficient than the same analysis with SRS. If the design effect is greater than 1, then the current analysis with the current sample is less efficient than the same analysis with a SRS. In general, clustering increases the design effect.
This is related to the idea of an “effective sample size”. For example, males born elsewhere: 1027/10.7602 = 95.44 total born elsewhere: 2083/18.9775 = 109.76.
proc surveyfreq data = nhanes2012; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; tables female*dmdborn4 / cv cvwt deff; format female fm. dmdborn4 cb.; run;
The SURVEYFREQ Procedure
Data Summary
Number of Strata 14
Number of Clusters 31
Number of Observations 9756
Sum of Weights 306590681
Table of female by DMDBORN4
Weighted Std Dev of CV for Std Err of
female DMDBORN4 Frequency Frequency Wgt Freq Wgt Freq Percent Percent
-------------------------------------------------------------------------------------------------
male born elsewhere 1027 22449131 2368069 0.1055 7.3247 0.8655
born in US 3825 127102676 8587374 0.0676 41.4710 0.6134
Total 4852 149551807 8773197 0.0587 48.7957 0.6428
-------------------------------------------------------------------------------------------------
female born elsewhere 1056 23543299 1926175 0.0818 7.6817 0.7451
born in US 3843 133390830 11273589 0.0845 43.5227 1.2458
Total 4899 156934129 11235137 0.0716 51.2043 0.6428
-------------------------------------------------------------------------------------------------
Total born elsewhere 2083 45992430 4177655 0.0908 15.0064 1.5756
born in US 7668 260493506 19670647 0.0755 84.9936 1.5756
Total 9751 306485936 19715992 0.0643 100.000
-------------------------------------------------------------------------------------------------
Frequency Missing = 5
Table of female by DMDBORN4
CV for Design
female DMDBORN4 Percent Effect
-----------------------------------------------
male born elsewhere 0.1182 10.7602
born in US 0.0148 1.5114
Total 0.0132 1.6126
-----------------------------------------------
female born elsewhere 0.0970 7.6319
born in US 0.0286 6.1566
Total 0.0126 1.6126
-----------------------------------------------
Total born elsewhere 0.1050 18.9775
born in US 0.0185 18.9775
Total
-----------------------------------------------
Frequency Missing = 5
Let’s look at some graphs. Using the format will put the labels on the x-axis on the graph.
ods graphics on;
proc surveyfreq data = nhanes2012;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
tables dmdmartl / plots = wtfreqplot;
format dmdmartl matsat.;
run;
ods graphics off;
The SURVEYFREQ Procedure
Data Summary
Number of Strata 14
Number of Clusters 31
Number of Observations 9756
Sum of Weights 306590681
Table of DMDMARTL
Weighted Std Dev of Std Err of
DMDMARTL Frequency Frequency Wgt Freq Percent Percent
--------------------------------------------------------------------------------------
married 2683 118822198 10556102 53.0792 2.0495
widowed 467 12586462 1087437 5.6225 0.3200
divorced 571 23926362 2606988 10.6882 0.7248
separated 204 5366932 614868 2.3975 0.3161
never married 1188 44479637 4687152 19.8695 2.3629
living with partner 440 18676762 1874572 8.3431 0.6473
Total 5553 223858353 14237911 100.000
--------------------------------------------------------------------------------------
Frequency Missing = 4203
* choose more interesting variables;
ods graphics on;
proc surveyfreq data = nhanes2012;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
tables female*dmdborn4 / plots = mosaicplot;
format female fm. dmdborn4 cb.;
run;
ods graphics off;
The SURVEYFREQ Procedure
Data Summary
Number of Strata 14
Number of Clusters 31
Number of Observations 9756
Sum of Weights 306590681
Table of female by DMDBORN4
Weighted Std Dev of Std Err of
female DMDBORN4 Frequency Frequency Wgt Freq Percent Percent
-------------------------------------------------------------------------------------------
male born elsewhere 1027 22449131 2368069 7.3247 0.8655
born in US 3825 127102676 8587374 41.4710 0.6134
Total 4852 149551807 8773197 48.7957 0.6428
-------------------------------------------------------------------------------------------
female born elsewhere 1056 23543299 1926175 7.6817 0.7451
born in US 3843 133390830 11273589 43.5227 1.2458
Total 4899 156934129 11235137 51.2043 0.6428
-------------------------------------------------------------------------------------------
Total born elsewhere 2083 45992430 4177655 15.0064 1.5756
born in US 7668 260493506 19670647 84.9936 1.5756
Total 9751 306485936 19715992 100.000
-------------------------------------------------------------------------------------------
Frequency Missing = 5
If you are requesting more than one plot, you need to enclose the plots in parentheses. The or option will give the odds ratios, and the risk option will give risks.
ods graphics on; proc surveyfreq data = nhanes2012; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; tables dmdmartl*female*dmdborn4 / risk or plots =(oddsratioplot relriskplot); format dmdmartl matsat. female fm. dmdborn4 cb.; run; ods graphics off;
The SURVEYFREQ Procedure
Data Summary
Number of Strata 14
Number of Clusters 31
Number of Observations 9756
Sum of Weights 306590681
Table of female by DMDBORN4
Controlling for DMDMARTL=married
Weighted Std Dev of Std Err of
female DMDBORN4 Frequency Frequency Wgt Freq Percent Percent
-------------------------------------------------------------------------------------------
male born elsewhere 554 11803069 1179456 9.9367 1.1615
born in US 874 46849089 4983912 39.4410 1.0838
Total 1428 58652159 5189428 49.3777 0.5675
-------------------------------------------------------------------------------------------
female born elsewhere 476 11167772 1125718 9.4019 1.0815
born in US 777 48962647 5297262 41.2204 1.3449
Total 1253 60130420 5451479 50.6223 0.5675
-------------------------------------------------------------------------------------------
Total born elsewhere 1030 22970842 2253478 19.3386 2.2056
born in US 1651 95811737 10207366 80.6614 2.2056
Total 2681 118782579 10555944 100.000
-------------------------------------------------------------------------------------------
Column 1 Risk Estimates
Standard
Risk Error 95% Confidence Limits
Row 1 0.2012 0.0228 0.1532 0.2492
Row 2 0.1857 0.0220 0.1393 0.2321
Total 0.1934 0.0221 0.1469 0.2399
Difference 0.0155 0.0078 -0.0010 0.0320
Difference is (Row 1 - Row 2)
Sample Size = 5549
Column 2 Risk Estimates
Standard
Risk Error 95% Confidence Limits
Row 1 0.7988 0.0228 0.7508 0.8468
Row 2 0.8143 0.0220 0.7679 0.8607
Total 0.8066 0.0221 0.7601 0.8531
Difference -0.0155 0.0078 -0.0320 0.0010
Difference is (Row 1 - Row 2)
Sample Size = 5549
Odds Ratio and Relative Risks (Row1/Row2)
Estimate 95% Confidence Limits
Odds Ratio 1.1046 0.9937 1.2278
Column 1 Relative Risk 1.0835 0.9944 1.1806
Column 2 Relative Risk 0.9809 0.9609 1.0014
Sample Size = 5549
Table of female by DMDBORN4
Controlling for DMDMARTL=widowed
Weighted Std Dev of Std Err of
female DMDBORN4 Frequency Frequency Wgt Freq Percent Percent
-------------------------------------------------------------------------------------------
male born elsewhere 23 299145 63828 2.3767 0.5554
born in US 99 2433302 326571 19.3327 2.1606
Total 122 2732447 337748 21.7094 2.3135
-------------------------------------------------------------------------------------------
female born elsewhere 82 1350041 202386 10.7261 1.8146
born in US 263 8503975 963871 67.5645 3.0838
Total 345 9854015 951308 78.2906 2.3135
-------------------------------------------------------------------------------------------
Total born elsewhere 105 1649186 210849 13.1029 1.9883
born in US 362 10937276 1100321 86.8971 1.9883
Total 467 12586462 1087437 100.000
-------------------------------------------------------------------------------------------
Column 1 Risk Estimates
Standard
Risk Error 95% Confidence Limits
Row 1 0.1095 0.0237 0.0595 0.1594
Row 2 0.1370 0.0239 0.0865 0.1875
Total 0.1310 0.0199 0.0891 0.1730
Difference -0.0275 0.0316 -0.0941 0.0391
Difference is (Row 1 - Row 2)
Sample Size = 5549
Column 2 Risk Estimates
Standard
Risk Error 95% Confidence Limits
Row 1 0.8905 0.0237 0.8406 0.9405
Row 2 0.8630 0.0239 0.8125 0.9135
Total 0.8690 0.0199 0.8270 0.9109
Difference 0.0275 0.0316 -0.0391 0.0941
Difference is (Row 1 - Row 2)
Sample Size = 5549
Odds Ratio and Relative Risks (Row1/Row2)
Estimate 95% Confidence Limits
Odds Ratio 0.7744 0.4141 1.4482
Column 1 Relative Risk 0.7991 0.4607 1.3860
Column 2 Relative Risk 1.0319 0.9564 1.1134
Sample Size = 5549
Table of female by DMDBORN4
Controlling for DMDMARTL=divorced
Weighted Std Dev of Std Err of
female DMDBORN4 Frequency Frequency Wgt Freq Percent Percent
-------------------------------------------------------------------------------------------
male born elsewhere 32 646692 190158 2.7028 0.8131
born in US 205 8640920 1381509 36.1146 3.6127
Total 237 9287612 1406845 38.8175 3.6318
-------------------------------------------------------------------------------------------
female born elsewhere 70 1676180 267604 7.0056 1.2917
born in US 264 12962570 1692897 54.1769 3.5153
Total 334 14638749 1727388 61.1825 3.6318
-------------------------------------------------------------------------------------------
Total born elsewhere 102 2322871 384346 9.7084 1.8069
born in US 469 21603490 2586103 90.2916 1.8069
Total 571 23926362 2606988 100.000
-------------------------------------------------------------------------------------------
Column 1 Risk Estimates
Standard
Risk Error 95% Confidence Limits
Row 1 0.0696 0.0211 0.0252 0.1141
Row 2 0.1145 0.0204 0.0715 0.1575
Total 0.0971 0.0181 0.0590 0.1352
Difference -0.0449 0.0210 -0.0893 -0.0005
Difference is (Row 1 - Row 2)
Sample Size = 5549
Column 2 Risk Estimates
Standard
Risk Error 95% Confidence Limits
Row 1 0.9304 0.0211 0.8859 0.9748
Row 2 0.8855 0.0204 0.8425 0.9285
Total 0.9029 0.0181 0.8648 0.9410
Difference is (Row 1 - Row 2)
Sample Size = 5549
Column 2 Risk Estimates
Standard
Risk Error 95% Confidence Limits
Difference 0.0449 0.0210 0.0005 0.0893
Difference is (Row 1 - Row 2)
Sample Size = 5549
Odds Ratio and Relative Risks (Row1/Row2)
Estimate 95% Confidence Limits
Odds Ratio 0.5788 0.3154 1.0619
Column 1 Relative Risk 0.6081 0.3466 1.0669
Column 2 Relative Risk 1.0507 1.0006 1.1033
Sample Size = 5549
Table of female by DMDBORN4
Controlling for DMDMARTL=separated
Weighted Std Dev of Std Err of
female DMDBORN4 Frequency Frequency Wgt Freq Percent Percent
-------------------------------------------------------------------------------------------
male born elsewhere 36 798720 216266 14.8822 3.5300
born in US 43 1220641 289387 22.7437 4.3090
Total 79 2019361 376397 37.6260 4.7175
-------------------------------------------------------------------------------------------
female born elsewhere 54 1387134 213988 25.8459 3.9489
born in US 71 1960437 364269 36.5281 4.9696
Total 125 3347571 413911 62.3740 4.7175
-------------------------------------------------------------------------------------------
Total born elsewhere 90 2185854 249474 40.7282 3.2853
born in US 114 3181078 458084 59.2718 3.2853
Total 204 5366932 614868 100.000
-------------------------------------------------------------------------------------------
Column 1 Risk Estimates
Standard
Risk Error 95% Confidence Limits
Row 1 0.3955 0.0822 0.2222 0.5689
Row 2 0.4144 0.0599 0.2880 0.5408
Total 0.4073 0.0329 0.3380 0.4766
Difference -0.0188 0.1254 -0.2835 0.2458
Difference is (Row 1 - Row 2)
Sample Size = 5549
Column 2 Risk Estimates
Standard
Risk Error 95% Confidence Limits
Row 1 0.6045 0.0822 0.4311 0.7778
Row 2 0.5856 0.0599 0.4592 0.7120
Total 0.5927 0.0329 0.5234 0.6620
Difference 0.0188 0.1254 -0.2458 0.2835
Difference is (Row 1 - Row 2)
Sample Size = 5549
Odds Ratio and Relative Risks (Row1/Row2)
Estimate 95% Confidence Limits
Odds Ratio 0.9248 0.3077 2.7793
Column 1 Relative Risk 0.9545 0.4948 1.8413
Column 2 Relative Risk 1.0322 0.6625 1.6082
Sample Size = 5549
Table of female by DMDBORN4
Controlling for DMDMARTL=never married
Weighted Std Dev of Std Err of
female DMDBORN4 Frequency Frequency Wgt Freq Percent Percent
-------------------------------------------------------------------------------------------
male born elsewhere 150 3945129 570145 8.8807 1.1797
born in US 482 21151252 2485409 47.6125 1.9035
Total 632 25096380 2750788 56.4931 1.9690
-------------------------------------------------------------------------------------------
female born elsewhere 133 3316186 623580 7.4649 1.1880
born in US 421 16011195 2027247 36.0420 2.3406
Total 554 19327382 2250319 43.5069 1.9690
-------------------------------------------------------------------------------------------
Total born elsewhere 283 7261315 1072947 16.3456 2.0345
born in US 903 37162447 4176512 83.6544 2.0345
Total 1186 44423762 4681954 100.000
-------------------------------------------------------------------------------------------
Column 1 Risk Estimates
Standard
Risk Error 95% Confidence Limits
Row 1 0.1572 0.0196 0.1158 0.1986
Row 2 0.1716 0.0287 0.1110 0.2321
Total 0.1635 0.0203 0.1205 0.2064
Difference -0.0144 0.0254 -0.0680 0.0393
Difference is (Row 1 - Row 2)
Sample Size = 5549
Column 2 Risk Estimates
Standard
Risk Error 95% Confidence Limits
Row 1 0.8428 0.0196 0.8014 0.8842
Row 2 0.8284 0.0287 0.7679 0.8890
Total 0.8365 0.0203 0.7936 0.8795
Difference is (Row 1 - Row 2)
Sample Size = 5549
Column 2 Risk Estimates
Standard
Risk Error 95% Confidence Limits
Difference 0.0144 0.0254 -0.0393 0.0680
Difference is (Row 1 - Row 2)
Sample Size = 5549
Odds Ratio and Relative Risks (Row1/Row2)
Estimate 95% Confidence Limits
Odds Ratio 0.9006 0.6143 1.3202
Column 1 Relative Risk 0.9162 0.6665 1.2593
Column 2 Relative Risk 1.0174 0.9537 1.0853
Sample Size = 5549
Table of female by DMDBORN4
Controlling for DMDMARTL=living with partner
Weighted Std Dev of Std Err of
female DMDBORN4 Frequency Frequency Wgt Freq Percent Percent
-------------------------------------------------------------------------------------------
male born elsewhere 69 2257394 449540 12.0866 1.9362
born in US 168 7272128 876424 38.9368 2.4468
Total 237 9529522 1065259 51.0234 1.7000
-------------------------------------------------------------------------------------------
female born elsewhere 64 2000636 413700 10.7119 2.0043
born in US 139 7146604 832299 38.2647 2.5917
Total 203 9147240 910699 48.9766 1.7000
-------------------------------------------------------------------------------------------
Total born elsewhere 133 4258030 799318 22.7985 3.5450
born in US 307 14418732 1572309 77.2015 3.5450
Total 440 18676762 1874572 100.000
-------------------------------------------------------------------------------------------
Column 1 Risk Estimates
Standard
Risk Error 95% Confidence Limits
Row 1 0.2369 0.0380 0.1567 0.3170
Row 2 0.2187 0.0414 0.1313 0.3061
Total 0.2280 0.0355 0.1532 0.3028
Difference 0.0182 0.0358 -0.0574 0.0937
Difference is (Row 1 - Row 2)
Sample Size = 5549
Column 2 Risk Estimates
Standard
Risk Error 95% Confidence Limits
Row 1 0.7631 0.0380 0.6830 0.8433
Row 2 0.7813 0.0414 0.6939 0.8687
Total 0.7720 0.0355 0.6972 0.8468
Difference -0.0182 0.0358 -0.0937 0.0574
Difference is (Row 1 - Row 2)
Sample Size = 5549
Odds Ratio and Relative Risks (Row1/Row2)
Estimate 95% Confidence Limits
Odds Ratio 1.1089 0.7191 1.7100
Column 1 Relative Risk 1.0831 0.7741 1.5155
Column 2 Relative Risk 0.9767 0.8859 1.0769
Sample Size = 5549
Analysis of subpopulations
Before we continue, we should pause to discuss the analysis of subpopulations. The analysis of subpopulations is one place where survey data and experimental data are quite different. If you have data from an experiment (or quasi-experiment), and you want to analyze the responses from, say, just the women, or just people over age 50, you can just delete the unwanted cases from the data set or use a by statement. Survey data are different. With survey data, you (almost) never get to delete any cases from the data set, even if you will never use them in any of your analyses. Because of the way the by statement works, you usually don’t use it with survey data either. Instead, SAS has provided a domain statement in most survey procedures that allows you to correctly analyze subpopulations of your survey data. A domain and a subpopulation are the same thing. The domain statement is very similar to using a by statement in that you will get output for each level of the variable (or variables) listed on the statement. This means that you will often times get more output that you want; you simply ignore the output for domains that are not of interest to you. Please note that there is no domain statement in proc surveyfreq; you are expected to include the variables that you would have put on the domain statement on the tables statement.
First, however, let’s take a second to see why deleting cases from a survey data set can be so problematic. If the data set is subset (meaning that observations not to be included in the subpopulation are deleted from the data set), two problems arise. First, the estimated number of elements in the population cannot be correctly calculated because some numbers are missing as you sum down the column of sampling weights. Secondly, the standard errors of the estimates cannot be calculated correctly. When the subpopulation option(s) is used, only the cases defined by the subpopulation are used in the calculation of the estimate, but all cases are used in the calculation of the standard errors. For more information on this issue, please see Sampling Techniques, Third Edition by William G. Cochran (1977) and Small Area Estimation by J. N. K. Rao (2003).
We will begin with an analysis that we have seen before.
* subpops; proc surveymeans data = nhanes2012; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; var pad630; run;
The SURVEYMEANS Procedure
Data Summary
Number of Strata 14
Number of Clusters 31
Number of Observations 9756
Sum of Weights 306590681
Statistics
Std Error
Variable N Mean of Mean 95% CL for Mean
---------------------------------------------------------------------------------
PAD630 2054 139.887377 5.579060 128.116590 151.658164
---------------------------------------------------------------------------------
Now let’s add the domain statement. The format statement is not technically needed, but it is a nice way to more clearly label the output. If you were interested only in the mean for females, you would ignore the output for males.
proc surveymeans data = nhanes2012; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; domain female; var pad630; format female fm.; run;
The SURVEYMEANS Procedure
Data Summary
Number of Strata 14
Number of Clusters 31
Number of Observations 9756
Sum of Weights 306590681
Statistics
Std Error
Variable N Mean of Mean 95% CL for Mean
---------------------------------------------------------------------------------
PAD630 2054 139.887377 5.579060 128.116590 151.658164
---------------------------------------------------------------------------------
Domain Analysis: female
Std Error
female Variable N Mean of Mean 95% CL for Mean
-------------------------------------------------------------------------------------------
male PAD630 1136 155.627196 7.008380 140.840807 170.413584
female PAD630 918 121.684284 5.352345 110.391824 132.976744
-------------------------------------------------------------------------------------------
In this example, we include two variables on the domain statement. Notice that this is the same as running proc surveymeans twice with each of the variables on the domain statement in turn.
proc surveymeans data = nhanes2012; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; domain female dmdmartl; var pad630; format dmdmartl matsat. female fm.; run;
The SURVEYMEANS Procedure
Data Summary
Number of Strata 14
Number of Clusters 31
Number of Observations 9756
Sum of Weights 306590681
Statistics
Std Error
Variable N Mean of Mean 95% CL for Mean
---------------------------------------------------------------------------------
PAD630 2054 139.887377 5.579060 128.116590 151.658164
---------------------------------------------------------------------------------
Domain Analysis: female
Std Error
female Variable N Mean of Mean 95% CL for Mean
-------------------------------------------------------------------------------------------
male PAD630 1136 155.627196 7.008380 140.840807 170.413584
female PAD630 918 121.684284 5.352345 110.391824 132.976744
-------------------------------------------------------------------------------------------
Domain Analysis: DMDMARTL
Std Error
DMDMARTL Variable N Mean of Mean 95% CL for Mean
----------------------------------------------------------------------------------------------
married PAD630 841 139.439779 6.869524 124.946351 153.933208
widowed PAD630 96 122.878708 9.530837 102.770399 142.987017
divorced PAD630 166 164.023980 12.070142 138.558206 189.489754
separated PAD630 70 195.067692 37.432563 116.091889 274.043496
never married PAD630 363 140.786819 10.691710 118.229282 163.344355
living with partner PAD630 169 165.483679 10.304480 143.743126 187.224232
----------------------------------------------------------------------------------------------
In this example, we cross the domain variables, giving us each combination of the two variables.
proc surveymeans data = nhanes2012; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; domain female*dmdmartl; var pad630; format dmdmartl matsat. female fm.; run;
The SURVEYMEANS Procedure
Data Summary
Number of Strata 14
Number of Clusters 31
Number of Observations 9756
Sum of Weights 306590681
Statistics
Std Error
Variable N Mean of Mean 95% CL for Mean
---------------------------------------------------------------------------------
PAD630 2054 139.887377 5.579060 128.116590 151.658164
---------------------------------------------------------------------------------
Domain Analysis: female*DMDMARTL
Std Error
female DMDMARTL Variable N Mean of Mean
-----------------------------------------------------------------------------------------
male married PAD630 494 156.694056 8.765612
widowed PAD630 23 132.611146 20.775755
divorced PAD630 77 209.422973 27.286194
separated PAD630 32 187.968526 38.861059
never married PAD630 214 146.423066 12.353287
living with partner PAD630 102 201.422030 13.418949
female married PAD630 347 118.625504 9.590405
widowed PAD630 73 120.956913 11.104393
divorced PAD630 89 124.634331 15.505908
separated PAD630 38 200.933304 52.499861
-----------------------------------------------------------------------------------------
Domain Analysis: female*DMDMARTL
female DMDMARTL Variable 95% CL for Mean
------------------------------------------------------------------
male married PAD630 138.200232 175.187881
widowed PAD630 88.778134 176.444159
divorced PAD630 151.854136 266.991811
separated PAD630 105.978858 269.958193
never married PAD630 120.359908 172.486223
living with partner PAD630 173.110522 229.733538
female married PAD630 98.391518 138.859490
widowed PAD630 97.528692 144.385134
divorced PAD630 91.919726 157.348937
separated PAD630 90.168280 311.698328
------------------------------------------------------------------
Domain Analysis: female*DMDMARTL
Std Error
female DMDMARTL Variable N Mean of Mean
-----------------------------------------------------------------------------------------
female never married PAD630 149 131.709083 13.305583
living with partner PAD630 67 122.573635 14.885114
-----------------------------------------------------------------------------------------
Domain Analysis: female*DMDMARTL
female DMDMARTL Variable 95% CL for Mean
------------------------------------------------------------------
female never married PAD630 103.636758 159.781409
living with partner PAD630 91.168789 153.978481
------------------------------------------------------------------
In this example, we cross three variables on the domain statement.
proc surveymeans data = nhanes2012; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; domain female*dmdmartl*dmdeduc2; var pad630; format dmdmartl matsat. female fm. dmdeduc2 edu.; run;
The SURVEYMEANS Procedure
Data Summary
Number of Strata 14
Number of Clusters 31
Number of Observations 9756
Sum of Weights 306590681
Statistics
Std Error
Variable N Mean of Mean 95% CL for Mean
---------------------------------------------------------------------------------
PAD630 2054 139.887377 5.579060 128.116590 151.658164
---------------------------------------------------------------------------------
Domain Analysis: female*DMDMARTL*DMDEDUC2
female DMDMARTL DMDEDUC2 Variable N Mean
-------------------------------------------------------------------------------------------------
male married less than 9th grade PAD630 44 175.262504
no hs diploma PAD630 68 191.892326
hs grad or GED PAD630 111 165.884895
some college or AA degree PAD630 166 155.183990
-------------------------------------------------------------------------------------------------
Domain Analysis: female*DMDMARTL*DMDEDUC2
Std Error
female DMDMARTL DMDEDUC2 Variable of Mean
----------------------------------------------------------------------------------
male married less than 9th grade PAD630 28.123585
no hs diploma PAD630 18.189105
hs grad or GED PAD630 9.758340
some college or AA degree PAD630 13.216380
----------------------------------------------------------------------------------
Domain Analysis: female*DMDMARTL*DMDEDUC2
female DMDMARTL DMDEDUC2 Variable 95% CL for Mean
-------------------------------------------------------------------------------------------
male married less than 9th grade PAD630 115.926926 234.598081
no hs diploma PAD630 153.516670 230.267982
hs grad or GED PAD630 145.296598 186.473193
some college or AA degree PAD630 127.299865 183.068115
-------------------------------------------------------------------------------------------
The SURVEYMEANS Procedure
Domain Analysis: female*DMDMARTL*DMDEDUC2
female DMDMARTL DMDEDUC2 Variable N Mean
-------------------------------------------------------------------------------------------------
male married college grad or above PAD630 105 133.266995
widowed less than 9th grade PAD630 4 187.343598
no hs diploma PAD630 4 167.204010
hs grad or GED PAD630 2 103.256927
some college or AA degree PAD630 9 125.939470
college grad or above PAD630 4 123.887738
divorced less than 9th grade PAD630 0 .
no hs diploma PAD630 13 238.105548
hs grad or GED PAD630 28 159.620092
some college or AA degree PAD630 28 271.147197
college grad or above PAD630 8 181.383414
separated less than 9th grade PAD630 6 205.133253
no hs diploma PAD630 7 199.085099
hs grad or GED PAD630 9 279.838117
some college or AA degree PAD630 8 81.121846
college grad or above PAD630 2 237.326057
never married less than 9th grade PAD630 8 221.332123
no hs diploma PAD630 24 260.453605
hs grad or GED PAD630 53 169.780030
some college or AA degree PAD630 93 125.941216
college grad or above PAD630 36 104.576598
living with partner less than 9th grade PAD630 9 188.915139
no hs diploma PAD630 21 257.398122
hs grad or GED PAD630 28 215.157719
some college or AA degree PAD630 36 198.181294
college grad or above PAD630 8 105.547477
female married less than 9th grade PAD630 14 123.187262
no hs diploma PAD630 36 143.361616
hs grad or GED PAD630 60 111.808026
some college or AA degree PAD630 121 146.288791
college grad or above PAD630 116 88.919333
widowed less than 9th grade PAD630 5 176.095932
no hs diploma PAD630 17 90.605348
hs grad or GED PAD630 15 110.480512
some college or AA degree PAD630 27 101.838525
college grad or above PAD630 9 260.996206
divorced less than 9th grade PAD630 3 54.288869
no hs diploma PAD630 10 163.202217
hs grad or GED PAD630 23 145.507378
some college or AA degree PAD630 39 129.588332
college grad or above PAD630 14 76.006626
separated less than 9th grade PAD630 4 169.359877
no hs diploma PAD630 10 165.848237
hs grad or GED PAD630 10 302.022494
some college or AA degree PAD630 10 142.532187
-------------------------------------------------------------------------------------------------
The SURVEYMEANS Procedure
Domain Analysis: female*DMDMARTL*DMDEDUC2
Std Error
female DMDMARTL DMDEDUC2 Variable of Mean
----------------------------------------------------------------------------------
male married college grad or above PAD630 10.848793
widowed less than 9th grade PAD630 62.966878
no hs diploma PAD630 14.100541
hs grad or GED PAD630 20.074215
some college or AA degree PAD630 35.754504
college grad or above PAD630 52.934045
divorced less than 9th grade PAD630 .
no hs diploma PAD630 29.710731
hs grad or GED PAD630 17.529990
some college or AA degree PAD630 56.919768
college grad or above PAD630 33.002598
separated less than 9th grade PAD630 35.534816
no hs diploma PAD630 38.033471
hs grad or GED PAD630 48.201023
some college or AA degree PAD630 29.502335
college grad or above PAD630 84.810682
never married less than 9th grade PAD630 60.764206
no hs diploma PAD630 41.069041
hs grad or GED PAD630 18.531657
some college or AA degree PAD630 11.381655
college grad or above PAD630 26.159027
living with partner less than 9th grade PAD630 71.240615
no hs diploma PAD630 43.612166
hs grad or GED PAD630 24.033974
some college or AA degree PAD630 25.916937
college grad or above PAD630 7.580480
female married less than 9th grade PAD630 33.994125
no hs diploma PAD630 14.005356
hs grad or GED PAD630 12.927725
some college or AA degree PAD630 22.892464
college grad or above PAD630 8.306940
widowed less than 9th grade PAD630 72.509645
no hs diploma PAD630 17.212933
hs grad or GED PAD630 37.211889
some college or AA degree PAD630 11.742558
college grad or above PAD630 63.187041
divorced less than 9th grade PAD630 15.497090
no hs diploma PAD630 47.287308
hs grad or GED PAD630 25.434714
some college or AA degree PAD630 28.490532
college grad or above PAD630 11.322661
separated less than 9th grade PAD630 80.576093
no hs diploma PAD630 46.573934
hs grad or GED PAD630 117.298227
some college or AA degree PAD630 26.011047
----------------------------------------------------------------------------------
The SURVEYMEANS Procedure
Domain Analysis: female*DMDMARTL*DMDEDUC2
female DMDMARTL DMDEDUC2 Variable 95% CL for Mean
-------------------------------------------------------------------------------------------
male married college grad or above PAD630 110.378042 156.155948
widowed less than 9th grade PAD630 54.495098 320.192099
no hs diploma PAD630 137.454469 196.953551
hs grad or GED PAD630 60.904035 145.609819
some college or AA degree PAD630 50.504061 201.374879
college grad or above PAD630 12.206665 235.568810
divorced less than 9th grade PAD630 . .
no hs diploma PAD630 175.421385 300.789710
hs grad or GED PAD630 122.635045 196.605139
some college or AA degree PAD630 151.056984 391.237409
college grad or above PAD630 111.754017 251.012810
separated less than 9th grade PAD630 130.161344 280.105162
no hs diploma PAD630 118.841490 279.328709
hs grad or GED PAD630 178.142847 381.533387
some college or AA degree PAD630 18.877360 143.366332
college grad or above PAD630 58.391158 416.260955
never married less than 9th grade PAD630 93.130856 349.533391
no hs diploma PAD630 173.805502 347.101708
hs grad or GED PAD630 130.681652 208.878407
some college or AA degree PAD630 101.928024 149.954409
college grad or above PAD630 49.385875 159.767320
living with partner less than 9th grade PAD630 38.610579 339.219699
no hs diploma PAD630 165.384495 349.411749
hs grad or GED PAD630 164.450467 265.864972
some college or AA degree PAD630 143.501336 252.861252
college grad or above PAD630 89.554062 121.540892
female married less than 9th grade PAD630 51.465926 194.908597
no hs diploma PAD630 113.812897 172.910335
hs grad or GED PAD630 84.532910 139.083141
some college or AA degree PAD630 97.989913 194.587668
college grad or above PAD630 71.393222 106.445445
widowed less than 9th grade PAD630 23.113955 329.077910
no hs diploma PAD630 54.289234 126.921462
hs grad or GED PAD630 31.970289 188.990735
some college or AA degree PAD630 77.063892 126.613157
college grad or above PAD630 127.683203 394.309208
divorced less than 9th grade PAD630 21.592867 86.984872
no hs diploma PAD630 63.434717 262.969717
hs grad or GED PAD630 91.844821 199.169934
some college or AA degree PAD630 69.478564 189.698101
college grad or above PAD630 52.117900 99.895353
separated less than 9th grade PAD630 -0.640820 339.360573
no hs diploma PAD630 67.585825 264.110649
hs grad or GED PAD630 54.544868 549.500120
some college or AA degree PAD630 87.653676 197.410699
-------------------------------------------------------------------------------------------
The SURVEYMEANS Procedure
Domain Analysis: female*DMDMARTL*DMDEDUC2
female DMDMARTL DMDEDUC2 Variable N Mean
-------------------------------------------------------------------------------------------------
female separated college grad or above PAD630 4 117.921631
never married less than 9th grade PAD630 4 309.991244
no hs diploma PAD630 10 166.333455
hs grad or GED PAD630 19 190.285276
some college or AA degree PAD630 79 109.549232
college grad or above PAD630 37 126.519184
living with partner less than 9th grade PAD630 2 120.000000
no hs diploma PAD630 9 144.995045
hs grad or GED PAD630 17 98.815434
some college or AA degree PAD630 26 152.592251
college grad or above PAD630 13 89.011298
-------------------------------------------------------------------------------------------------
Domain Analysis: female*DMDMARTL*DMDEDUC2
Std Error
female DMDMARTL DMDEDUC2 Variable of Mean
----------------------------------------------------------------------------------
female separated college grad or above PAD630 34.795950
never married less than 9th grade PAD630 93.152302
no hs diploma PAD630 46.458980
hs grad or GED PAD630 55.647515
some college or AA degree PAD630 15.629741
college grad or above PAD630 19.292689
living with partner less than 9th grade PAD630 0
no hs diploma PAD630 24.397178
hs grad or GED PAD630 21.160766
some college or AA degree PAD630 23.746801
college grad or above PAD630 23.175916
----------------------------------------------------------------------------------
Domain Analysis: female*DMDMARTL*DMDEDUC2
female DMDMARTL DMDEDUC2 Variable 95% CL for Mean
-------------------------------------------------------------------------------------------
female separated college grad or above PAD630 44.508594 191.334667
never married less than 9th grade PAD630 113.457066 506.525422
no hs diploma PAD630 68.313575 264.353336
hs grad or GED PAD630 72.879282 307.691269
some college or AA degree PAD630 76.573361 142.525104
college grad or above PAD630 85.815170 167.223199
living with partner less than 9th grade PAD630 120.000000 120.000000
no hs diploma PAD630 93.521499 196.468590
hs grad or GED PAD630 54.170121 143.460748
-------------------------------------------------------------------------------------------
The SURVEYMEANS Procedure
Domain Analysis: female*DMDMARTL*DMDEDUC2
female DMDMARTL DMDEDUC2 Variable 95% CL for Mean
-------------------------------------------------------------------------------------------
female living with partner some college or AA degree PAD630 102.490879 202.693622
college grad or above PAD630 40.114391 137.908206
-------------------------------------------------------------------------------------------
By using proc print, we can see that there are only two cases that have a valid value for pad630 in subpopulation females who are living with a partner and have less than nine years of education, and both of those values are 120. This is why no standard error can be estimated.
proc print data = nhanes2012; var pad630; where female = 1 and dmdmartl = 6 and dmdeduc2 = 1; run;
Obs PAD630 344 120 479 . 1339 . 1962 . 1987 120 2075 . 2148 . 2178 . 2631 . 2972 . 3148 . 4118 . 4595 . 6610 . 7064 . 7112 . 7214 . 7709 . 7829 . 8095 . 8479 .
Now let’s say that you want to compare the means from two domains. In this example, we get the mean of pad630 for females and males.
proc surveymeans data = nhanes2012; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; domain female; var pad630; format female fm.; run;
The SURVEYMEANS Procedure
Data Summary
Number of Strata 14
Number of Clusters 31
Number of Observations 9756
Sum of Weights 306590681
Statistics
Std Error
Variable N Mean of Mean 95% CL for Mean
---------------------------------------------------------------------------------
PAD630 2054 139.887377 5.579060 128.116590 151.658164
---------------------------------------------------------------------------------
Domain Analysis: female
Std Error
female Variable N Mean of Mean 95% CL for Mean
-------------------------------------------------------------------------------------------
male PAD630 1136 155.627196 7.008380 140.840807 170.413584
female PAD630 918 121.684284 5.352345 110.391824 132.976744
-------------------------------------------------------------------------------------------
There are a few different ways that you could compare 155.63 and 121.68. In this example, we will use proc surveyreg and the contrast statement. Notice that the output in the section titled Estimated Regression Coefficients is almost identical to the output of the proc surveymeans above.
proc surveyreg data = nhanes2012; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; class female; model pad630 = female / noint solution vadjust = none; contrast 'comparing males and females' female 1 -1; format female fm.; run;
The SURVEYREG Procedure
Regression Analysis for Dependent Variable PAD630
Data Summary
Number of Observations 2054
Sum of Weights 88768571
Weighted Mean of PAD630 139.88738
Weighted Sum of PAD630 1.24176E10
Design Summary
Number of Strata 14
Number of Clusters 31
Fit Statistics
R-square 0.5545
Root MSE 126.37
Denominator DF 17
Class Level Information
CLASS
Variable Levels Values
female 2 female male
Tests of Model Effects
Effect Num DF F Value Pr > F
Model 2 325.56 <.0001
female 2 325.56 <.0001
NOTE: The denominator degrees of freedom for the F tests is 17.
Estimated Regression Coefficients
Standard
Parameter Estimate Error t Value Pr > |t|
female female 121.684284 5.35234462 22.73 <.0001
female male 155.627196 7.00837974 22.21 <.0001
NOTE: The denominator degrees of freedom for the t tests is 17.
Regression Analysis for Dependent Variable PAD630
Analysis of Contrasts
Contrast Num DF F Value Pr > F
comparing males and females 1 31.67 <.0001
NOTE: The denominator degrees of freedom for the F tests is 17.
In this example, we do the same analysis using proc surveymeans and the lsmeans statement. This example is adapted from code on the SAS website here.
proc surveyreg data = nhanes2012; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; class female; model pad630 = female / noint solution vadjust = none; lsmeans female / diff; format female fm.; run;
The SURVEYREG Procedure
Regression Analysis for Dependent Variable PAD630
Data Summary
Number of Observations 2054
Sum of Weights 88768571
Weighted Mean of PAD630 139.88738
Weighted Sum of PAD630 1.24176E10
Design Summary
Number of Strata 14
Number of Clusters 31
Fit Statistics
R-square 0.5545
Root MSE 126.37
Denominator DF 17
Class Level Information
CLASS
Variable Levels Values
female 2 female male
Tests of Model Effects
Effect Num DF F Value Pr > F
Model 2 325.56 <.0001
female 2 325.56 <.0001
NOTE: The denominator degrees of freedom for the F tests is 17.
Estimated Regression Coefficients
Standard
Parameter Estimate Error t Value Pr > |t|
female female 121.684284 5.35234462 22.73 <.0001
female male 155.627196 7.00837974 22.21 <.0001
NOTE: The denominator degrees of freedom for the t tests is 17.
Regression Analysis for Dependent Variable PAD630
female Least Squares Means
Standard
female Estimate Error DF t Value Pr > |t|
female 121.68 5.3523 17 22.73 <.0001
male 155.63 7.0084 17 22.21 <.0001
Differences of female Least Squares Means
Standard
female _female Estimate Error DF t Value Pr > |t|
female male -33.9429 6.0314 17 -5.63 <.0001
If you square the t-value from this analysis, you will get the F-value given in the previous proc surveyreg analysis. The estimate of -33.94 is simply the difference of the means, 121.68 – 155.63 (with a little rounding error).
Regression
Now we will look at a few examples of regression analyses. We will use proc surveyreg and proc surveylogistic. The variables in these examples were chosen because they were either continuous or categorical, not because of data that they contain. In other words, the models shown here were not constructed to make substantive sense; rather, they were constructed to illustrate how certain things can be done. The variable pad630 is the number of minutes spent doing moderate-intensity activities on a typical work day.
proc surveyreg data = nhanes2012; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; class female; model pad630 = female ridageyr / solution; format female fm.; run;
The SURVEYREG Procedure
Regression Analysis for Dependent Variable PAD630
Data Summary
Number of Observations 2054
Sum of Weights 88768571
Weighted Mean of PAD630 139.88738
Weighted Sum of PAD630 1.24176E10
Design Summary
Number of Strata 14
Number of Clusters 31
Fit Statistics
R-Square 0.01933
Root MSE 126.29
Denominator DF 17
Class Level Information
CLASS
Variable Levels Values
female 2 female male
Tests of Model Effects
Effect Num DF F Value Pr > F
Model 2 18.23 <.0001
Intercept 1 327.92 <.0001
female 1 30.26 <.0001
RIDAGEYR 1 4.97 0.0396
NOTE: The denominator degrees of freedom for the F tests is 17.
Estimated Regression Coefficients
Standard
Parameter Estimate Error t Value Pr > |t|
Intercept 167.289537 9.31899525 17.95 <.0001
female female -33.125134 6.02167950 -5.50 <.0001
female male 0.000000 0.00000000 . .
RIDAGEYR -0.287604 0.12906252 -2.23 0.0396
NOTE: The degrees of freedom for the t tests is 17.
Matrix X'WX is singular and a generalized inverse was used to solve the normal equations.
Estimates are not unique.
There is a class statement in proc surveyreg (there isn’t one in proc reg), and, depending on the version of SAS/Stat that you are running, it does have many of the options that are found on the class statements in most other SAS procedures. The default in SAS is to use the highest-numbered category as the reference group. Hence, in previous example, the reference group is 1 (females), not 0 (males), for the variable female. In the example below, the category coded 0 is used as the reference group for both predictor variables, female and hsq571 (have you donated blood in the last year?). Besides using the options on the class statement, you can change the reference group by using a format such that the group that you want to be the reference group has the value label that comes first alphabetically. Notice on the model statement that the “|” symbol was used. This tells SAS to include both the main effects and the interaction in the model.
proc surveyreg data = nhanes2012; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; class female (ref = first) hsq571 (ref = '0'); * class female hsq571 / ref = first; model pad630 = female|hsq571 ridageyr / solution; run;
The SURVEYREG Procedure
Regression Analysis for Dependent Variable PAD630
Data Summary
Number of Observations 1673
Sum of Weights 76183526
Weighted Mean of PAD630 145.83021
Weighted Sum of PAD630 1.11099E10
Design Summary
Number of Strata 14
Number of Clusters 31
Fit Statistics
R-Square 0.04574
Root MSE 126.20
Denominator DF 17
Class Level Information
CLASS
Variable Levels Values
female 2 1 0
HSQ571 2 1 0
Tests of Model Effects
Effect Num DF F Value Pr > F
Model 4 27.13 |t|
Intercept 209.155683 11.9883632 17.45
Below are a few examples of binary logistic regression. The variable paq665 asks if you do any moderate-intensity sports. A little data management is needed before we can run the logistic regression. Notice the options on the class statement and the model statement.
* logistic regression; data nhanes2012b; set nhanes2012; age1 = 0; if ridageyr > 20 then age1 = 1; paq665 = paq665-1; run;
* how you specify the reference group depends on whether or not you have a format; * notice where the desc option goes; proc surveylogistic data = nhanes2012b; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; class hsd010 (reference = '3') female (reference = 'male') / param = ref; model paq665 (desc) = hsd010|female ridageyr; format female fm.; run;
The SURVEYLOGISTIC Procedure
Model Information
Data Set WORK.NHANES2012B
Response Variable PAQ665
Number of Response Levels 2
Stratum Variable SDMVSTRA
Number of Strata 14
Cluster Variable SDMVPSU
Number of Clusters 31
Weight Variable WTINT2YR
Model Binary Logit
Optimization Technique Fisher's Scoring
Variance Adjustment Degrees of Freedom (DF)
Variance Estimation
Method Taylor Series
Variance Adjustment Degrees of Freedom (DF)
Number of Observations Read 9756
Number of Observations Used 5890
Sum of Weights Read 3.0659E8
Sum of Weights Used 2.2591E8
Response Profile
Ordered Total Total
Value PAQ665 Frequency Weight
1 1 3347 117045323
2 0 2543 108863614
Probability modeled is PAQ665=1.
NOTE: 3866 observations were deleted due to missing values for the response or explanatory
variables.
Class Level Information
Class Value Design Variables
HSD010 1 1 0 0 0
2 0 1 0 0
3 0 0 0 0
4 0 0 1 0
5 0 0 0 1
Class Level Information
Class Value Design Variables
female female 1
male 0
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
Intercept and
Criterion Only Covariates
AIC 312879907 301730876
SC 312879914 301730950
-2 Log L 312879905 301730854
Testing Global Null Hypothesis: BETA=0
Test F Value Num DF Den DF Pr > F
Likelihood Ratio 1114905 10 Infty <.0001
Score 66.25 10 17 <.0001
Wald 54.07 10 17 <.0001
Joint Tests
Wald
Effect DF Chi-Square Pr > ChiSq
HSD010 4 107.3724 <.0001
female 1 0.2340 0.6286
HSD010*female 4 7.8696 0.0965
RIDAGEYR 1 19.3881 <.0001
NOTE: Under full-rank parameterizations, Type 3 effect tests are replaced by joint tests. The
joint test for an effect is a test that all the parameters associated with that effect are
zero. Such joint tests might not be equivalent to Type 3 effect tests under GLM
parameterization.
Analysis of Maximum Likelihood Estimates
Standard
Parameter Estimate Error t Value Pr > |t|
Intercept -0.2185 0.1105 -1.98 0.0645
HSD010 1 -0.3381 0.1390 -2.43 0.0263
HSD010 2 -0.4097 0.1646 -2.49 0.0234
HSD010 4 0.5232 0.1553 3.37 0.0036
HSD010 5 1.4377 0.4666 3.08 0.0068
female female 0.0682 0.1410 0.48 0.6348
HSD010*female 1 female -0.4494 0.1744 -2.58 0.0196
HSD010*female 2 female -0.1535 0.2124 -0.72 0.4796
HSD010*female 4 female -0.1603 0.2996 -0.54 0.5996
HSD010*female 5 female -0.1676 0.5184 -0.32 0.7503
RIDAGEYR 0.00963 0.00219 4.40 0.0004
NOTE: The degrees of freedom for the t tests is 17.
Odds Ratio Estimates
Point 95% Confidence
Effect Estimate Limits
RIDAGEYR 1.010 1.005 1.014
NOTE: The degrees of freedom in computing
the confidence limits is 17.
Association of Predicted Probabilities and Observed Responses
Percent Concordant 61.3 Somers' D 0.232
Percent Discordant 38.1 Gamma 0.234
Percent Tied 0.6 Tau-a 0.114
Pairs 8511421 c 0.616
We can use the expb option on the model statement to get the odds ratios. We can use the clodds option to get the confidence limits around the odds ratios. SAS will provide a generalized R-square, but not all statisticians agree that this is appropriate. The variable hsq470 is the number of days in the last 30 days that physical health was not good, and hsq480 is the number of days in the past 30 days that mental health was not good.
proc surveylogistic data = nhanes2012b; weight wtint2yr; cluster sdmvpsu; strata sdmvstra; model paq665 (desc) = ridageyr hsq470 hsq480 / expb clodds rsquare; run;
The SURVEYLOGISTIC Procedure
Model Information
Data Set WORK.NHANES2012B
Response Variable PAQ665
Number of Response Levels 2
Stratum Variable SDMVSTRA
Number of Strata 14
Cluster Variable SDMVPSU
Number of Clusters 31
Weight Variable WTINT2YR
Model Binary Logit
Optimization Technique Fisher's Scoring
Variance Adjustment Degrees of Freedom (DF)
Variance Estimation
Method Taylor Series
Variance Adjustment Degrees of Freedom (DF)
Number of Observations Read 9756
Number of Observations Used 5873
Sum of Weights Read 3.0659E8
Sum of Weights Used 2.2552E8
Response Profile
Ordered Total Total
Value PAQ665 Frequency Weight
1 1 3334 116723574
2 0 2539 108793841
Probability modeled is PAQ665=1.
NOTE: 3883 observations were deleted due to missing values for the response or explanatory
variables.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
Intercept and
Criterion Only Covariates
AIC 312354637 306355580
SC 312354644 306355607
-2 Log L 312354635 306355572
R-Square 1.0000 Max-rescaled R-Square 1.0000
Testing Global Null Hypothesis: BETA=0
Test F Value Num DF Den DF Pr > F
Likelihood Ratio 1999688 3 Infty <.0001
Score 12.07 3 17 0.0002
Wald 13.15 3 17 0.0001
Analysis of Maximum Likelihood Estimates
Standard
Parameter Estimate Error t Value Pr > |t| Exp(Est)
Intercept -0.5190 0.1022 -5.08 <.0001 0.595
RIDAGEYR 0.0106 0.00222 4.77 0.0002 1.011
HSQ470 0.0296 0.00755 3.92 0.0011 1.030
HSQ480 0.0124 0.00289 4.30 0.0005 1.013
NOTE: The degrees of freedom for the t tests is 17.
Association of Predicted Probabilities and Observed Responses
Percent Concordant 57.5 Somers' D 0.159
Percent Discordant 41.6 Gamma 0.161
Percent Tied 0.9 Tau-a 0.078
Pairs 8465026 c 0.580
Odds Ratio Estimates and t Confidence Intervals
Effect Unit Estimate 95% Confidence Limits
RIDAGEYR 1.0000 1.011 1.006 1.015
NOTE: The degrees of freedom in computing
Odds Ratio Estimates and t Confidence Intervals
Effect Unit Estimate 95% Confidence Limits
HSQ470 1.0000 1.030 1.014 1.047
HSQ480 1.0000 1.013 1.006 1.019
NOTE: The degrees of freedom in computing
the confidence limits is 17.
Using proc glimmix
The following example is copied directly from the SAS website because I don’t have any good data to use for an example. Please see this page for this example and more information. Besides the comments that SAS makes below, there are a few things that I would like to point out. First, notice that you MUST supply two weight variables: a weight for level 1 and a weight for level 2. This is not an inconvenience of using SAS; rather, this is true of running any type of multilevel model in any statistical package. You need to do this because the level 1 sampling weights and the level 2 sampling weights enter into the multilevel model equation in different places. Having the two sampling weights is often a problem with public-use survey data sets, because the data are often not released with level 1 and level 2 weights. The other issue that you need to know about is the scaling of the level 1 sampling weights. This is not an issue in single-level models, but it is an issue in multilevel models. At this time, SAS does not have an option to scale the weights for you; rather, you need to do it yourself in a data step before you run proc glimmix. Please see Pfeffermann, et. al. (1998) and Rabe-Hesketh and Skrondal (2006) for more information.
proc glimmix data=dws method=quadrature empirical=classical; class id; model y = x1 x2 / dist=binomial link=probit obsweight=sw1 solution; random int / subject=id weight=w2; run;
To fit a weighted multilevel model, you should use METHOD=QUAD. The EMPIRICAL=CLASSICAL option in the PROC GLIMMIX statement instructs PROC GLIMMIX to compute the empirical (sandwich) variance estimators for the fixed effect and the variance. The empirical variance estimators are recommended for the inference about fixed effects and variance estimated by pseudo-likelihood.
Carle (2009) provides the SAS and Stata code for the two most common methods of scaling the level 1 weights in Appendix B of his paper Fitting Multilevel Models in Complex Survey Data with Design Weights: Recommendations. One method scales the level 1 weight to the sample size within each cluster; the other method scales the level 1 weight to the effective sample size. There is currently no recommendation about when to use either type of scaling; rather, the recommendation is to do a sensitivity analysis comparing both methods.
A few words about proc surveyimpute
The following is quoted from the SAS documentation: http://support.sas.com/documentation/cdl/en/statug/68162/HTML/default/viewer.htm#statug_surveyimpute_overview.htm
The SURVEYIMPUTE procedure imputes missing values of an item in a data set by replacing them with observed values from the same item. The principles by which the imputation is performed are particularly useful for survey data. PROC SURVEYIMPUTE also computes replicate weights (such as jackknife weights) that account for the imputation and that can be used for replication-based variance estimation for complex surveys. The procedure implements a fractional hot-deck imputation technique (Kim and Fuller 2004; Fuller 2009; Kim and Shao 2014) in addition to some traditional hot-deck imputation techniques (Andridge and Little 2010).
Nonresponse is a common problem in almost all surveys of human populations. Estimators that are based on survey data that include nonresponse can suffer from nonresponse bias if the nonrespondents are different from the respondents. Estimators that use complete cases (only the observed units) might also be less precise. Imputation techniques are important tools for reducing nonresponse bias and producing efficient estimators.
The main objectives of any imputation technique are to eliminate the nonresponse bias and to provide an imputed data set that results in consistent analyses conducted with the imputed data. In addition, a variance estimator must be available that accounts for both the sampling variance and the imputation variance. Imputation techniques use implicit or explicit models. Some model-based imputation techniques include multiple imputation, mean imputation, and regression imputation. For more information about multiple imputation in SAS/STAT, see Chapter 75: The MI Procedure, and Chapter 76: The MIANALYZE Procedure.
Imputation techniques that do not use explicit models include hot-deck imputation, cold-deck imputation, and fractional imputation. PROC SURVEYIMPUTE implements imputation techniques that do not use explicit models. It also produces replicate weights that can be used with any survey analysis procedure in SAS/STAT to estimate both the sampling variability and the imputation variability.
Hot-deck imputation is the most commonly used imputation technique for survey data. A donor is selected for a recipient unit, and the observed values of the donor are imputed for the missing items of the recipient. Although the imputation method is straightforward, the variance estimator that accounts for imputation variance might not be simple and is often ignored in practice. PROC SURVEYIMPUTE does not create imputation-adjusted replicate weights for hot-deck imputation.
Fractional hot-deck imputation (Kalton and Kish 1984; Fay 1996; Kim and Fuller 2004; Fuller and Kim 2005), also known as fractional imputation (FI), is a variation of hot-deck imputation in which one missing item for a recipient is imputed from multiple donors. Each donor donates a fraction of the original weight of the recipient such that the sum of the fractional weights from all the donors is equal to the original weight of the recipient. For fully efficient fractional imputation (FEFI), all observed values in an imputation cell are used as donors for a recipient unit in that cell (Kim and Fuller 2004).
The SURVEYIMPUTE procedure implements single and multiple hot-deck imputation and FEFI. Available donor selection techniques include simple random selection with or without replacement, probability proportional to weights selection (Rao and Shao 1992), and approximate Bayesian bootstrap selection (Rubin and Schenker 1986).
End quote.
A great deal of work has been done with respect to imputation methods for complex survey data. While this topic is beyond the scope of this workshop, interested readers may want to see
Andridge Rebecca R. and Roderick J. Little. (2009). The Use of Sample Weights in Hot Deck Imputation. Journal of Official Statistics; 25(1): 21-36.
and
Bell, Bethany A., Kromrey, Jeffrey D., and Ferron, John M. (2009). Section on Survey Research Methods, JSM 2009.
For more information on using the NHANES data sets
There are helpful resources for learning how to analyze the NHANES data sets correctly. One is a listserv at http://www.cdc.gov/nchs/nhanes/nhanes_listserv.htm . There are also online tutorials at http://www.cdc.gov/nchs/tutorials/index.htm .
References
Applied Survey Data Analysis by Steven G. Heeringa, Brady T. West, and Patricia A. Berglund
Analysis of Health Surveys by Edward L. Korn and Barry I. Graubard
Sampling of Populations: Methods and Applications, Fourth Edition by Paul Levy and Stanley Lemeshow
Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner
Sampling Techniques, Third Edition by William G. Cochran
Pfeffermann, D., Skinner, C. J., Holmes, D. J., Goldstein, H., and Rasbash, J. (1998), Weighting for Unequal Selection Probabilities in Multilevel Models, Journal of the Royal Statistical Society, Series B, 60, 23-40.
Rabe-Hesketh, S. and Skrondal, A. (2006), Multilevel Modelling of Complex Survey Data, Journal of the Royal Statistical Society, Series A, 169, 805-827.
Carle, Adam C. (2009). Fitting Multilevel Models in Complex Survey Data with Design Weights: Recommendations. BMC Medical Research Methodology; 9(49).
Quartagno, M., Carpenter, R., and Goldstein, H.. (2019). Multiple Imputation with Survey Weights: A Multilevel Approach. Journal of Survey Statistics and Methodology, Volume 8, Issue 5, November 2020, Pages 965–989, https://doi.org/10.1093/jssam/smz036


* getting just the histogram;
ods graphics on;
proc surveymeans data = nhanes2012 plots = histogram;
weight wtint2yr;
cluster sdmvpsu;
strata sdmvstra;
var ridageyr;
run;
ods graphics off;




