We will be using SAS-callable SUDAAN for this seminar. This means that SUDAAN runs from within your SAS session. You do not need to "turn on" SUDAAN or doing anything like that. Once SAS is running, you access SUDAAN by running a SUDAAN command. Your SAS and SUDAAN code will be in the same program and there is no need to differentiate the code in any way. If you have a copy of stand-alone SUDAAN, most of what is presented here will work for you, except, of course, the SAS code. Also, you will not need the run statements at the end of the proc steps.
As you may have guessed, SUDAAN code looks much like SAS code. There are eight analysis procs in SUDAAN. All of your data management must be done in another package. Because we are using SAS-callable SUDAAN, we will assume that you will do your data management in SAS, and we assume that you are familiar with the basics of doing data management in SAS. If you are not, or if you would like some additional information, please see our page on data management in SAS. Perhaps the most important data management issue that you will encounter is that SUDAAN considers values of 0 to be missing for all procs except proc rlogist (used for logistic regression) when the variable is used as the dependent variable. This means that if you have a variable called female that is coded 1 for females and 0 for males, all of the males in the data set will be considered missing. You can either recode such variables in SAS or you can use the recode statement in SUDAAN. We have some examples of how to use the recode statement in our FAQ How can I use the recode statement in SUDAAN? . Another important data management issue is how missing values are coded in your data set. If you are using a public-use data file, you can look at the codebook for this information. Frequently, values such as 999, 888, -999 and -888 are used to indicate missing values. You will need to recode these before using them in SUDAAN, as there is no way to tell SUDAAN to consider such values missing. We have included some SAS code at the end of this seminar that can be used as a template for recoding missing values in all numerical variables in a data set.
One other thing that we should point out before we start running some code is the way SUDAAN code looks in the SAS Enhanced Program Editor. Unlike SAS, SUDAAN does not use the coloring of the key words. Hence, you will see some words in blue and others in red. The coloring is no indication of the correctness of your SUDAAN program. Just ignore it.
For the following examples, we will use the CHIS data set. CHIS is a publicly available data set that uses replicate weights to correct the standard errors of the estimates instead of PSUs and strata. For examples of how to use these variables in SUDAAN, please see our page based on textbook examples. We have run the SAS code to create a temporary SAS data set which we have called chis. We will start with proc descript, which is used to get basic descriptive statistics.
proc descript data=chis filetype=sas design = jackknife; weight rakedw0; jackwgts rakedw1--rakedw80 / adjjack=1; var ad5; run;
Let’s look at the code above before we look at the output. The proc descript command is a SUDAAN command. The data = option is just like the SAS data = option on a proc statement. Note that you cannot use a pathname here (unlike SAS). You must use either a temporary data set, as we do here, or a libname. The filetype = option is needed to tell SUDAAN what type of data file you are using. Note that SUDAAN can also read SAS export files (sasxport), ascii and SPSS files. The design = option is also necessary. You will get this information from the codebook. In this case, we specify the design as jackknife because we have jackknife replicate weights. Jackknife is one of several ways to create replicate weights, so even though you see replicate weights in the data file, you need to know how they were created so that you can specify them correctly in your SUDAAN program. Again, this is information that you will get from the documentation. There is no way of knowing this information just from looking at the data set. On the weight statement we indicate the pweight, sometimes called the final pweight. On the jackwgts statement we indicate which variables are the replicate weights. The double dash is used to indicate positionally consecutive variables in the data set. The adjjack = option after the slash is necessary. While this is often set to 1, you will need to check the documentation for the correct value. Although these three lines of code seem to be complex, once you have them correct, they do not need to be modified again. Personally, I just cut-and-paste them from one proc to the next. Because they specify the sampling design used in the collection of the survey data, this is information that will not change during the course of data analysis. All of this information is specified in exactly the same way for every proc in SUDAAN. The var statement is the same as the var statement is SAS. Then you give a run statement and you are finished! Finally, let’s look at the output.
S U D A A N Software for the Statistical Analysis of Correlated Data Copyright Research Triangle Institute January 2003 Release 8.0.2
Number of observations read : 55428 Weighted count : 23847415 Denominator degrees of freedom : 80
Date: 02-26-2004 Research Triangle Institute Page : 1 Time: 13:20:25 The DESCRIPT Procedure Table : 1
Variance Estimation Method: Replicate Weight Jackknife by: Variable, One.
----------------------------------------------------- | | | | Variable | | One | | | 1 | ----------------------------------------------------- | | | | | How many Pap | Sample Size | 30530 | | smear tests | Weighted Size | 11141052.70 | | last 6 years | Total | 51425352.66 | | | Mean | 4.62 | | | SE Mean | 0.02 | -----------------------------------------------------
We see at the top that 55428 observations were read in the data set, and when weighted with the pweight, the count is 23,847,415. The first line of the output in the table shows the sample size. This is the number of cases used in the analysis. The second line indicates the number of individuals in the population the sample size represents. Note that these numbers differ from those at the top of the output because of the variable we are looking at (men don’t get Pap smears). The third line gives the total, the fourth line the mean and the fifth line the standard error of the mean of the variable specified on the var statement. You can use options on the proc descript statement to add other statistics to this output, as well as adding an output or print statement.
Now let’s try an example in which the totals and means are found for different groups, such as race. We add the categorical variable on the tables statement. In SUDAAN version 8, you will also need to use a subgroup statement, on which you list the variables just as they appear on the tables statement, and a levels statement, on which you specify the number of levels of the variable(s) on the subgroup statement. We will look at seven categories of race. If we only wanted to look at the first three categories, we could type 3 on the levels statement. If you had two variables listed on the subgroup statement, you would list the number of categories for each variable on the levels statement in the order that the variables are listed on the subgroup statement. Like the variable names, the number of levels are separated by a space.
proc descript data=chis filetype=sas design = jackknife; weight rakedw0; jackwgts rakedw1--rakedw80 / adjjack=1; var ad5; tables racehpra; subgroup racehpra; levels 7; run;
Variance Estimation Method: Replicate Weight Jackknife by: Variable, Race - UCLA CHPR Definition.
----------------------------------------------------------------------------------- | | | | Variable | | Race - UCLA CHPR Definition | | | Total | LATINO | PACIFIC | | | | | | ISLANDER | ----------------------------------------------------------------------------------- | | | | | | | How many Pap | Sample Size | 30530 | 5027 | 106 | | smear tests | Weighted Size | 11141052.70 | 2504107.61 | 24165.49 | | last 6 years | Total | 51425352.66 | 11440176.40 | 111606.38 | | | Mean | 4.62 | 4.57 | 4.62 | | | SE Mean | 0.02 | 0.05 | 0.43 | -----------------------------------------------------------------------------------
----------------------------------------------------------------------------------- | | | | Variable | | Race - UCLA CHPR Definition | | | AIAN | ASIAN | AFRICAN | | | | | | AMERICAN | ----------------------------------------------------------------------------------- | | | | | | | How many Pap | Sample Size | 418 | 1814 | 1657 | | smear tests | Weighted Size | 39588.04 | 1002420.57 | 727651.50 | | last 6 years | Total | 177162.48 | 4112383.47 | 3629231.58 | | | Mean | 4.48 | 4.10 | 4.99 | | | SE Mean | 0.20 | 0.11 | 0.11 | -----------------------------------------------------------------------------------
-------------------------------------------------------------------- | | | | Variable | | Race - UCLA CHPR Definition | | | WHITE | OTH | | | | | SINGL/MULTI | | | | | RACE | -------------------------------------------------------------------- | | | | | | How many Pap | Sample Size | 20692 | 816 | | smear tests | Weighted Size | 6501687.43 | 341432.05 | | last 6 years | Total | 30299302.56 | 1655489.78 | | | Mean | 4.66 | 4.85 | | | SE Mean | 0.03 | 0.12 | --------------------------------------------------------------------
In the third column in the table on the top, we see the results for the total number of cases involved in the analysis. In the following columns we see the results broken out for each of the races. Notice the differences in the sample sizes.
Now let’s try using proc crosstab. We will make a crosstab of gender and race. We are only going to use two levels of race, just to make the output shorter.
proc crosstab data=chis filetype=sas design = jackknife; weight rakedw0; jackwgts rakedw1--rakedw80 / adjjack=1; tables srsex*racehpra; subgroup srsex racehpra; levels 2 2; run;
Variance Estimation Method: Replicate Weight Jackknife by: SRSEX, RACEHPRA.
----------------------------------------------------------------------------- | | | | SRSEX | | RACEHPRA | | | Total | LATINO | PACIFIC | | | | | | ISLANDER | ----------------------------------------------------------------------------- | | | | | | | Total | Sample Size | 9677 | 9458 | 219 | | | Weighted Size | 5705917.88 | 5643945.79 | 61972.10 | | | SE Weighted | 28246.94 | 28469.00 | 4755.06 | | | Row Percent | 100.00 | 98.91 | 1.09 | | | Col Percent | 100.00 | 100.00 | 100.00 | | | Tot Percent | 100.00 | 98.91 | 1.09 | | | SE Row Percent | 0.00 | 0.08 | 0.08 | | | SE Col Percent | 0.00 | 0.00 | 0.00 | | | SE Tot Percent | 0.00 | 0.08 | 0.08 | ----------------------------------------------------------------------------- | | | | | | | MALE | Sample Size | 4084 | 3983 | 101 | | | Weighted Size | 2866894.01 | 2836612.17 | 30281.84 | | | SE Weighted | 30195.55 | 29750.97 | 3293.96 | | | Row Percent | 100.00 | 98.94 | 1.06 | | | Col Percent | 50.24 | 50.26 | 48.86 | | | Tot Percent | 50.24 | 49.71 | 0.53 | | | SE Row Percent | 0.00 | 0.11 | 0.11 | | | SE Col Percent | 0.51 | 0.51 | 4.44 | | | SE Tot Percent | 0.51 | 0.50 | 0.06 | ----------------------------------------------------------------------------- | | | | | | | FEMALE | Sample Size | 5593 | 5475 | 118 | | | Weighted Size | 2839023.87 | 2807333.62 | 31690.25 | | | SE Weighted | 34030.51 | 34193.28 | 3931.19 | | | Row Percent | 100.00 | 98.88 | 1.12 | | | Col Percent | 49.76 | 49.74 | 51.14 | | | Tot Percent | 49.76 | 49.20 | 0.56 | | | SE Row Percent | 0.00 | 0.14 | 0.14 | | | SE Col Percent | 0.51 | 0.51 | 4.44 | | | SE Tot Percent | 0.51 | 0.51 | 0.07 | -----------------------------------------------------------------------------
In the row labeled "Total", we see the totals collapsed across gender. In the column labeled "Total", we see the totals collapsed across race. You can make multilayered tables by adding more variables to the tables statement (with * in between each variable).
Although we did not have this happen in the above examples, it often happens that you see stars in your output instead of numbers. Don’t worry – this does not mean that there was an error in your SUDAAN code or trouble calculating things. Rather, it means that SUDAAN did not have enough space in the column to print the number (remember that numbers in this kind of output can get to be really big). We have a FAQ on how to change the stars to numbers that will show you how to fix this problem.
Let’s move from basic descriptive statistics to some analyses. We will start with a regression. Remember that the output from the regression (as well as from any other analysis in SUDAAN) is the same as it would be using non-survey data. In other words, the interpretation of the output does not change just because we are using survey data or a special statistical package for the analysis. We will use ae13, which is the number of drinks on the days on which one drinks alcohol, as the dependent variable, and ae14, number of times having five or more drinks in past month, as the independent variable. We do not claim that this model is sensible or that we are testing any specific hypothesis. Rather, we just selected two continuous variables for this example.
proc regress data=chis filetype=sas design = jackknife; weight rakedw0; jackwgts rakedw1--rakedw80 / adjjack=1; model ae13 = ae14; run;
Number of observations read : 55428 Weighted count: 23847415 Observations used in the analysis : 32538 Weighted count: 13783845 Denominator degrees of freedom : 80
Maximum number of estimable parameters for the model is 2 Weighted mean response is 2.188590
Multiple R-Square for the dependent variable AE13: 0.241897
Variance Estimation Method: Replicate Weight Jackknife Working Correlations: Independent Link Function: Identity Response variable AE13: Number of drinks on the days drinking alcohol
---------------------------------------------------------------------- Independent P-value Variables and Beta T-Test Effects Coeff. SE Beta T-Test B=0 B=0 ---------------------------------------------------------------------- Intercept 1.88 0.01 152.15 0.0000 Number of times having 5 or more drinks in past month 0.34 0.01 25.47 0.0000 ----------------------------------------------------------------------
-------------------------------------------------------
Contrast Degrees of P-value Freedom Wald F Wald F ------------------------------------------------------- OVERALL MODEL 2 12818.28 0.0000 MODEL MINUS INTERCEPT 1 648.71 0.0000 INTERCEPT 1 23150.59 0.0000 AE14 1 648.71 0.0000 -------------------------------------------------------
You will notice from the first two lines of the output that there were many (thousands) more observations read than were used in the analysis. This is because of missing data. We are also given the weighted mean of the dependent variable and the multiple R-squared. Next we see the coefficients and significance tests. The degrees of freedom are given in the last table. The t-tests shown in the first table are equivalent to the Wald F tests shown in the second table (within rounding error).
You can use categorical variables in your regression by using the subgroup and levels statements.
proc regress data=chis filetype=sas design = jackknife; weight rakedw0; jackwgts rakedw1--rakedw80 / adjjack=1; model ae13 = ae14 srsex; subgroup srsex; levels 2; run;
Number of observations read : 55428 Weighted count: 23847415 Observations used in the analysis : 32538 Weighted count: 13783845 Denominator degrees of freedom : 80
Maximum number of estimable parameters for the model is 3 Weighted mean response is 2.188590
Multiple R-Square for the dependent variable AE13: 0.259603
Variance Estimation Method: Replicate Weight Jackknife Working Correlations: Independent Link Function: Identity Response variable AE13: Number of drinks on the days drinking alcohol
---------------------------------------------------------------------- Independent P-value Variables and Beta T-Test Effects Coeff. SE Beta T-Test B=0 B=0 ---------------------------------------------------------------------- Intercept 1.61 0.01 116.51 0.0000 Number of times having 5 or more drinks in past month 0.32 0.01 24.90 0.0000 Self-reported gender MALE 0.52 0.03 19.77 0.0000 FEMALE 0.00 0.00 . . ----------------------------------------------------------------------
-------------------------------------------------------
Contrast Degrees of P-value Freedom Wald F Wald F ------------------------------------------------------- OVERALL MODEL 3 10053.93 0.0000 MODEL MINUS INTERCEPT 2 528.93 0.0000 INTERCEPT . . . AE14 1 619.78 0.0000 SRSEX 1 390.70 0.0000 -------------------------------------------------------
You can change the reference level of the categorical variable by using the reflevel statement, as shown below. All you need to do is list the variable and indicate which value should be used as the reference level. You can specify more than one variable on this statement if you need to set the reference values for more than one variable. Please see our FAQ on using categorical variables in regression analyses for more details and examples using only some of the categories of a variable.
proc regress data=chis filetype=sas design = jackknife; weight rakedw0; jackwgts rakedw1--rakedw80 / adjjack=1; model ae13 = ae14 srsex; subgroup srsex; relevel srsex = 1; levels 2; run;
Number of observations read : 55428 Weighted count: 23847415 Observations used in the analysis : 32538 Weighted count: 13783845 Denominator degrees of freedom : 80
Maximum number of estimable parameters for the model is 3 Weighted mean response is 2.188590
Multiple R-Square for the dependent variable AE13: 0.259603
Variance Estimation Method: Replicate Weight Jackknife Working Correlations: Independent Link Function: Identity Response variable AE13: Number of drinks on the days drinking alcohol
---------------------------------------------------------------------- Independent P-value Variables and Beta T-Test Effects Coeff. SE Beta T-Test B=0 B=0 ---------------------------------------------------------------------- Intercept 2.13 0.02 101.59 0.0000 Number of times having 5 or more drinks in past month 0.32 0.01 24.90 0.0000 Self-reported gender MALE 0.00 0.00 . . FEMALE -0.52 0.03 -19.77 0.0000 ----------------------------------------------------------------------
-------------------------------------------------------
Contrast Degrees of P-value Freedom Wald F Wald F ------------------------------------------------------- OVERALL MODEL 3 10053.93 0.0000 MODEL MINUS INTERCEPT 2 528.93 0.0000 INTERCEPT . . . AE14 1 619.78 0.0000 SRSEX 1 390.70 0.0000 -------------------------------------------------------
Perhaps now is a good time to talk about subpopulations. Many times researchers are interested only in a certain subpopulation, or group, of respondents. For example, you may be interested only females or only in people who call themselves white, or white females. In order to limit your analysis to just these folks, you might be tempted to use a SAS data set with a subsetting if statement and create a smaller data set with just the individuals of interest. DO NOT DO THIS!!!!! Instead, use the subpopn statement with the intact (complete, whole) data set. We have a FAQ on how to use the subpopn statement where we list references documenting the concerns with incorrectly subsetting your data set and the problems that this can cause. Happily, using the subpopn statement is really easy, and it requires much less effort than creating a new data set. Let’s suppose that we wanted to run our first regression, but only with female respondents. First, let’s be very clear about how the data are coded. We will use two proc freqs to do this. In the first one, we see the labels for gender, and in the second one we see the numbers with which the levels are actually coded.
proc freq data = chis; tables srsex; run;
The FREQ Procedure
Cumulative Cumulative SRSEX Frequency Percent Frequency Percent ----------------------------------------------------------- MALE 23002 41.50 23002 41.50 FEMALE 32426 58.50 55428 100.00
proc freq data = chis; tables srsex; format srsex; run;
The FREQ Procedure
Cumulative Cumulative SRSEX Frequency Percent Frequency Percent ---------------------------------------------------------- 1 23002 41.50 23002 41.50 2 32426 58.50 55428 100.00
Now that we are certain that the females are coded as 2, we can use that value on our subpopn statement. Also note that we have bolded the line in the output that tells you for what subpopulation the analysis was done.
proc regress data=chis filetype=sas design = jackknife; weight rakedw0; jackwgts rakedw1--rakedw80 / adjjack=1; model ae13 = ae14; subpopn srsex = 2; run;
Number of observations read : 55428 Weighted count: 23847415 Observations in subpopulation : 32426 Weighted count: 12215687 Observations used in the analysis : 17097 Weighted count: 6104293 Denominator degrees of freedom : 80
Maximum number of estimable parameters for the model is 2 Weighted mean response is 1.720202
Multiple R-Square for the dependent variable AE13: 0.162257
Variance Estimation Method: Replicate Weight Jackknife Working Correlations: Independent Link Function: Identity Response variable AE13: AE13 For Subpopulation: SRSEX = 2
---------------------------------------------------------------------- Independent P-value Variables and Beta T-Test Effects Coeff. SE Beta T-Test B=0 B=0 ---------------------------------------------------------------------- Intercept 1.59 0.01 109.96 0.0000 AE14 0.37 0.03 13.80 0.0000 ----------------------------------------------------------------------
-------------------------------------------------------------------------------------------------- -------------------------------------------------------
Contrast Degrees of P-value Freedom Wald F Wald F ------------------------------------------------------- OVERALL MODEL 2 7841.08 0.0000 MODEL MINUS INTERCEPT 1 190.51 0.0000 INTERCEPT 1 12091.73 0.0000 AE14 1 190.51 0.0000 -------------------------------------------------------
While we will not cover interactions here, you can see our FAQ on how to create interaction terms in SUDAAN. Of course, you can always create the terms that you need in a SAS data step and then use them in your SUDAAN code.
Finally, let’s try a logistic regression. Because we are using SAS-callable SUDAAN, we need to use an alias for proc logistic (so that SAS knows to call the SUDAAN version of the command instead of the SAS version). The alias is proc rlogist. Remember that the dependent variable in proc rlogist must be coded 0/1, but two-level categorical independent variables cannot be coded 0/1 (use 1/2 instead). We are going to use ae9 as our dependent variable, which indicates if the respondent has taken a vitamin or dietary supplement in the past month. We will use proc freq to see how this variable is coded. From the first output, we see that the values are labeled "yes" and "no". In the second proc freq, we will use the format statement to suppress these labels, showing us that the variable is coded 1/2.
proc freq data = chis; tables ae9; run;
The FREQ Procedure
Cumulative Cumulative AE9 Frequency Percent Frequency Percent -------------------------------------------------------------------- YES 34110 61.59 34110 61.59 NO 21271 38.41 55381 100.00
Frequency Missing = 47
proc freq data = chis; tables ae9; format ae9; run;
The FREQ Procedure
Cumulative Cumulative AE9 Frequency Percent Frequency Percent -------------------------------------------------------- 1 34110 61.59 34110 61.59 2 21271 38.41 55381 100.00
Frequency Missing = 47
Now that we are certain how the data are coded, we can write a little data step to change the coding to 0/1. After that, we are ready to run the logistic regression.
data chis1; set chis; ae91 = ae9 - 1; run;
proc rlogist data=chis1 filetype=sas design = jackknife; weight rakedw0; jackwgts rakedw1--rakedw80 / adjjack=1; model ae91 = ae14; run;
Number of zero responses : 21202 Number of non-zero responses : 11843
Independence parameters have converged in 5 iterations
Number of observations read : 55428 Weighted count: 23847415 Observations used in the analysis : 33045 Weighted count: 13995933 Denominator degrees of freedom : 80
Maximum number of estimable parameters for the model is 2
Sample and Population Counts for Response Variable AE91 0: Sample Count 21202 Population Count 8418183 1: Sample Count 11843 Population Count 5577750
R-Square for dependent variable AE91 (Cox & Snell, 1989): 0.003526
-2 * Normalized Log-Likelihood with Intercepts Only : 44439.56 -2 * Normalized Log-Likelihood Full Model : 44322.82 Approximate Chi-Square (-2 * Log-L Ratio) : 116.73 Degrees of Freedom : 1
Note: The approximate Chi-Square is not adjusted for clustering. Refer to hypothesis test table for adjusted test.
Variance Estimation Method: Replicate Weight Jackknife Working Correlations: Independent Link Function: Logit Response variable AE91: AE91
---------------------------------------------------------------------- Independent P-value Variables and Beta T-Test Effects Coeff. SE Beta T-Test B=0 B=0 ---------------------------------------------------------------------- Intercept -0.45 0.02 -29.76 0.0000 AE14 0.04 0.01 7.07 0.0000 ----------------------------------------------------------------------
Variance Estimation Method: Replicate Weight Jackknife Working Correlations: Independent Link Function: Logit Response variable AE91: AE91
-------------------------------------------------------
Contrast Degrees of P-value Freedom Wald F Wald F ------------------------------------------------------- OVERALL MODEL 2 442.74 0.0000 MODEL MINUS INTERCEPT 1 49.97 0.0000 INTERCEPT 1 885.38 0.0000 AE14 1 49.97 0.0000 -------------------------------------------------------
----------------------------------------------------------- Independent Variables and Lower 95% Upper 95% Effects Odds Ratio Limit OR Limit OR ----------------------------------------------------------- Intercept 0.64 0.62 0.66 AE14 1.04 1.03 1.06 -----------------------------------------------------------
As we can see from the output, we are given both the coefficients and the odds ratios by default. We also get all of the standard output that you expect when you run a logistic regression.
Lastly, let’s talk about missing values in data sets. Below is some SAS code that we have written to change all of the missing values in the numerical variables in the data set into missing values that SUDAAN can understand. Remember that when we got the CHIS data set, missing values were coded as -9, -8, etc., which SUDAAN (and SAS) interpret as valid values. These need to changed in some way so that the program understands that these are really missing values. For more information on missing values in SAS (and an explanation of .a, .b, etc.), please see our FAQ on how to code missing data in SAS.
data chis; set "D:CHIS DataCHIS2001_PUFA2_082802"; array allnum(*) _numeric_; do i = 1 to dim(allnum); if allnum(i) =-1 then allnum(i) = .a; else if allnum(i) =-2 then allnum(i) = .b; else if allnum(i) =-3 then allnum(i) = .c; else if allnum(i) =-4 then allnum(i) = .d; else if allnum(i) =-5 then allnum(i) = .e; else if allnum(i) =-6 then allnum(i) = .f; else if allnum(i) =-7 then allnum(i) = .g; else if allnum(i) =-8 then allnum(i) = .h; else if allnum(i) =-9 then allnum(i) = .i; else if allnum(i) =-10 then allnum(i) = .j; else if allnum(i) =-11 then allnum(i) = .k; end; drop i; run;