Sample Setups for Commonly Used Survey Data Sets

This page shows the survey setups for common public use data sets in various statistical packages, including SUDAAN, Stata and SAS. If you are using an earlier version of one of these packages, the code provided below may not work. Also, please note that for your particular analysis, different sampling weight and/or replicate weights may be necessary. For data sets that contain multiple sampling weights and/or replicate weights, the documentation for the survey will indicate when each set of weights should be used. Many of the setups below show the use of different weights with the same data set. Pay special attention to this issue when merging data sets. This page is in no way intended to be a substitute for reading the documentation for the data set.

If you would like more information on the elements of survey designs, including sampling weights, PSUs, stratification and replicate weights, please see our page on replicate weights. For more information on data analysis in Stata, please see our seminar on Survey Data Analysis in Stata. For more information on using SUDAAN to analyze survey data, please see our seminar Introduction to SUDAAN.

A note about missing data: Many of the variables in these data sets have special values for missing data, such as 8888 or -9. In most cases, the statistical package (e.g., Stata, SAS, SUDAAN) will not know that these values should be considered missing, and they will be included as legitimate values in any analysis that is run. To convert these values to missing, please see our Stata FAQ if you are using Stata, and our SAS learning module on missing data or our SAS FAQ if you are using either SAS or SUDAAN. Also note that the different programs handle missing data differently when you use more than one variable in a descriptive command. For example in Stata, svy: mean x y may give you different results than if you used two commands, svy: mean x and svy: mean y, if x and y have different patterns of missing data. You can (usually) quickly tell if listwise deletion is being used by the number of observations being used in the analysis.

A note about non-positive probability weights or replicate weights: The different programs handle non-positive (i.e., zero) weights differently. Stata can use cases with non-positive sampling weights by specifying iweight instead of pweight; hence the total number of cases read is the total number of cases used. As a consequence, the number of raw cases used in each category in the Stata output is different from that shown by SUDAAN or SAS. The top of the SAS output indicates the total number of cases in the data file, as well as the number of cases with a non-positive probability weight and the number of cases used. The raw number of cases matches that given by SUDAAN. SUDAAN does not count these cases as cases read in and gives a note at the top of the output. The cases with non-positives weight are not included in the raw frequency of cases for each category shown in the first part o the output. However, in all cases, the percent of weighted cases for each category is the same for all packages.

A note about output: We have included the output from Stata and SAS but have omitted the output from SUDAAN to save space.

This page contains the setups for the following data sets:

ACS Add Health CHIS CPS GSS LA FANS NHANES Continuous NHANES III NCS SIPP US Census 2000

The output from the Stata and SAS commands is shown; the output from SUDAAN has been omitted to save space.

ACS (American Community Survey)

The American Community Survey is, among other things, the replacement for the long form of the US Census. You can access one-year, three-year or five-year PUMS datasets from the ACS website. The ACS User’s Guide can be found here. The datasets (from 2005 onward) are released with successive differences replicate weights as the method of variance estimation. The documentation for this method can be found here. For more information about the use of the replicate weights, please see http://usa.ipums.org/usa/repwt.shtml . Note that most datasets are released with both person-level and household-level weights (both sampling weights and replicate weights).

This example uses the single year 2010 PUMS dataset, ss10hak. The weights used are household-level weights.

Stata

svyset [pw=wgtp], sdr(wgtp1 - wgtp80) vce(sdr) mse
* If negative replicate weights are a problem, specify the pweight as an iweight.
svyset [iw=wgtp], sdr(wgtp1 - wgtp80) vce(sdr) mse
svy: mean rmsp
(running mean on estimation sample)

SDR replications (80)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..............................

Survey: Mean estimation           Number of obs   =      3,335
                                  Population size =    307,065
                                  Replications    =         80

--------------------------------------------------------------
             |                SDR *
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
        rmsp |   5.027004   .0459453      4.936953    5.117055
--------------------------------------------------------------

SAS

proc surveymeans data = acs2010 varmethod = jackknife;
weight wgtp;
repweights wgtp1 -- wgtp80 / jkcoefs = 0.05;
var rmsp;
run;

* If negative replicate weights are a problem, you might want to set them to 0;

data acs2010_nn;
set acs2010;
array temp(*) wgtp1-wgtp80;

do i = 1 to dim(temp);
   if temp(i) < 0 then temp(i)=0;
end;
run;

proc surveymeans data = acs2010_nn varmethod = jackknife;
weight wgtp;
repweights wgtp1 -- wgtp80 / jkcoefs = 0.05;
var rmsp;
run;

The SURVEYMEANS Procedure

                    Data Summary

Number of Observations                          3335
Number of Observations Used                     3071
Number of Obs with Nonpositive Weights           264
Sum of Weights                                307065


       Variance Estimation

Method                   Jackknife
Replicate Weights       ACS2010_NN
Number of Replicates            80


                                   Statistics

                                               Std Error
Variable               N            Mean         of Mean       95% CL for Mean
---------------------------------------------------------------------------------
RMSP                3071        5.027004        0.045948    4.93556376 5.11844435
---------------------------------------------------------------------------------

SUDAAN

proc descript data = acs2010_nn filetype = sas design = jackknife;
weight wgtp;
jackwgts wgtp1 -- wgtp80 / adjjack = .05;
var rmsp;
setenv colwidth = 19;
setenv decwidth = 6;
run;

AddHealth (National Longitudinal Study of Adolescent Health, 1994-2008)

There are four waves of Add Health data. The data can be downloaded here (University of North Carolina) or here (ICPSR). The User Guides and Documentation can be found here. There is also an excellent discussion of common mistakes to avoid when analyzing these data. Note that weight variables are in a separate dataset (one for each wave of data), so the weight variable needs to be merged with the file containing the analysis variables. Also, some of the data files contain more variables than can be read using Stata I/C (Intercooled Stata). You can use Stata S/E, Stata M/P or SAS to reduce the number of variables if you want to do your analysis in Stata I/C.

CHIS (California Health Interview Survey)

Please note that you need to register to access the CHIS data. The data and documentation can be obtained from the CHIS web site. The CHIS methodology documentation can be found here. CHIS data are released with a sampling weight and jackknife replicate weights. The adjustment value is 1.

The 2009 adult dataset is used in the example below.

Stata

svyset [pw = rakedw0], jkrw(rakedw1 - rakedw80, multiplier(1)) vce(jack) mse

      pweight: rakedw0
          VCE: jackknife
          MSE: on
    jkrweight: rakedw1 rakedw2 rakedw3 rakedw4 rakedw5 rakedw6 rakedw7 rakedw8 rakedw9 rakedw10 rakedw11 rakedw12 rakedw13
               rakedw14 rakedw15 rakedw16 rakedw17 rakedw18 rakedw19 rakedw20 rakedw21 rakedw22 rakedw23 rakedw24 rakedw25
               rakedw26 rakedw27 rakedw28 rakedw29 rakedw30 rakedw31 rakedw32 rakedw33 rakedw34 rakedw35 rakedw36 rakedw37
               rakedw38 rakedw39 rakedw40 rakedw41 rakedw42 rakedw43 rakedw44 rakedw45 rakedw46 rakedw47 rakedw48 rakedw49
               rakedw50 rakedw51 rakedw52 rakedw53 rakedw54 rakedw55 rakedw56 rakedw57 rakedw58 rakedw59 rakedw60 rakedw61
               rakedw62 rakedw63 rakedw64 rakedw65 rakedw66 rakedw67 rakedw68 rakedw69 rakedw70 rakedw71 rakedw72 rakedw73
               rakedw74 rakedw75 rakedw76 rakedw77 rakedw78 rakedw79 rakedw80
     Strata 1: <one>
         SU 1: <observations>
        FPC 1: <zero>

svy: mean bmi_p
(running mean on estimation sample)

Jackknife replications (80)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..............................

Survey: Mean estimation

Number of strata =       1        Number of obs    =     47614
                                  Population size  =  27546591
                                  Replications     =        80
                                  Design df        =        79

--------------------------------------------------------------
             |              Jknife *
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       bmi_p |   26.77736   .0532953      26.67127    26.88344
--------------------------------------------------------------

SAS

proc surveymeans data = chis2009_adult varmethod = jackknife;
weight rakedw0;
repweight rakedw1 -- rakedw80 / jkcoef = 1;
var bmi_p;
run;

The SURVEYMEANS Procedure

            Data Summary

Number of Observations         47614
Sum of Weights              27546591


         Variance Estimation

Method                       Jackknife
Replicate Weights       CHIS2009_ADULT
Number of Replicates                80


                                   Statistics

                                               Std Error
Variable               N            Mean         of Mean       95% CL for Mean
---------------------------------------------------------------------------------
BMI_P              47614       26.777356        0.053296    26.6712935 26.8834179
---------------------------------------------------------------------------------

SUDAAN

proc descript data = chis2009_adult filetype = sas design = jackknife;
weight rakedw0;
jackwgts rakedw1 -- rakedw80 / adjjack = 1;
var bmi_p;
setenv decwidth = 6;
setenv colwidth = 18;
run;

CPS (Current Population Survey)

The data and documentation can be obtained from either the IPUMS or the CPS website. The CPS datasets are released with successive difference replicate weights. For more information, please see http://cps.ipums.org/cps/repwt.shtml . Please read the documentation very carefully, especially with respect to how the weight variables are stored in the dataset. You may need to divide the sampling weight by 100 and the replicate weights by 1000 before using these weights in your analysis (depending on where you downloaded the data).

The March 2011 supplement is used for this example.

Stata

Note that iweight is specified instead of pweight because some of the replicate weight values are negative. The generate and foreach commands are provided if you need to divide your weights.

gen wtsupp2 = wtsupp/100;
foreach var of varlist repwtp1 - repwtp160 {
   gen n`var' = `var'/10000
}

svyset [iw=wtsupp], sdrweight(repwtp1-repwtp160) vce(sdr)
svy: mean age

(running mean on estimation sample)

SDR replications (160)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..................................................   100
..................................................   150
..........

Survey: Mean estimation          Number of obs    =     204983
                                 Population size  =  306109661
                                 Replications     =        160

--------------------------------------------------------------
             |                 SDR
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
         age |   36.99796   .0077159      36.98284    37.01308
--------------------------------------------------------------

SAS

The data step below shows the division of weights, if it is needed.

data cps_3_2011;
set temp;
wtsupp2 = wtsupp/100;
array Arepwtp(160) repwtp1 - repwtp160;
array Arepwtpn(160) repwtpn1 - repwtpn160;
do x = 1 to 160;
Arepwtpn(x) = Arepwtp(x)/10000;
end;
run;

proc surveymeans data = cps_3_2011 varmethod = jackknife;
weight wtsupp;
repweights repwtp1 -- repwtp160;
var age;
run;

The SURVEYMEANS Procedure

            Data Summary

Number of Observations        204983
Sum of Weights             306109661


       Variance Estimation

Method                   Jackknife
Replicate Weights       CPS_3_2011
Number of Replicates           160


                                   Statistics

                                               Std Error
Variable               N            Mean         of Mean       95% CL for Mean
---------------------------------------------------------------------------------
age               204983       36.997962        0.075575    36.8487086 37.1472144
---------------------------------------------------------------------------------

SUDAAN

proc descript data = cps_3_2011 filetype = sas design = brr;
weight wtsupp;
repwgt repwtp1 -- repwtp160;
var age;
setenv colwidth = 19;
setenv decwidth = 3;
run;

GSS (General Social Survey)

The GSS data and documentation can be found here. There are datasets from 1972 to 2016.

The 2010 data are used for this example. Please note that although the sampling design includes stratification, the stratification variable was not released in the dataset.

NOTE: The difference in estimated population sizes between Stata and SAS has to do with the 996 missing cases on the variable wwwhr.

Stata

svyset sampcode [pw= wtssnr]

      pweight: wtssnr
          VCE: linearized
  Single unit: missing
     Strata 1: <one>
         SU 1: sampcode
        FPC 1: <zero>

svy: mean wwwhr
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =       1          Number of obs    =    1048
Number of PSUs   =      79          Population size  = 1084.08
                                    Design df        =      78

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       wwwhr |   9.968178   .5051429      8.962516    10.97384
--------------------------------------------------------------

SAS

proc surveymeans data = gss2010;
weight wtssnr;
cluster sampcode;
var wwwhr;
run;

The SURVEYMEANS Procedure

            Data Summary

Number of Clusters                79
Number of Observations          2044
Sum of Weights            2043.99999


                                   Statistics

                                               Std Error
Variable               N            Mean         of Mean       95% CL for Mean
---------------------------------------------------------------------------------
wwwhr               1048        9.968178        0.505143    8.96251566 10.9738402
---------------------------------------------------------------------------------

SUDAAN

proc sort data = gss2010;
by sampcode;
run;

proc descript data = gss2010 filetype = sas design = wr;
weight wtssnr;
nest sampcode;
var wwwhr;
setenv colwidth = 19;
setenv decwidth = 6;
run;

LA FANS (Los Angeles Family and Neighborhood Survey)

Please note that you need to register to use the L.A. FANS data. The link to the public use L.A. FANS-2 data files is here. Documentation regarding the sampling can be found here.

Stata

svyset [pw=wgtadlt], strata(povcat)

      pweight: wgtadlt
          VCE: linearized
  Single unit: missing
     Strata 1: povcat
         SU 1: <observations>
        FPC 1: <zero>

svy: mean ab5
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =       3          Number of obs    =    2595
Number of PSUs   =    2595          Population size  = 3173.95
                                    Design df        =    2592

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
         ab5 |   2.551448    .023172      2.506011    2.596886
--------------------------------------------------------------

SAS

proc surveymeans data = lafans1;
weight wgtadlt;
strata povcat;
var ab5;
run;

The SURVEYMEANS Procedure

                    Data Summary

Number of Strata                                   3
Number of Observations                         10195
Number of Observations Used                     3535
Number of Obs with Nonpositive Weights          6660
Sum of Weights                             3535.6271


                                   Statistics

                                               Std Error
Variable               N            Mean         of Mean       95% CL for Mean
---------------------------------------------------------------------------------
AB5                 2595        2.551448        0.023172    2.50601087 2.59688568
---------------------------------------------------------------------------------

SUDAAN

proc sort data = lafans1;
by povcat;
run;

proc descript data = lafans1 filetype = sas design = wr;
nest povcat;
weight wgtadlt;
var ab5;
setenv colwidth = 18;
setenv decwidth = 6;
run;

NHANES – Continuous (National Health and Nutrition Examination Survey)

The NHANES data and documentation can be found here. The online tutorials for these datasets are very good, and we recommend that you look these materials over before using these datasets. These tutorials also include information about combining datasets from different years. For these examples, we will use the 2009-2010 demographics dataset; the documentation for this particular dataset can be found here. Starting in 1999, the data are released only with masked strata and PSU variables; no replicate weights are provided.

Stata

svyset sdmvpsu [pw=WTINT2YR], strata(sdmvstra)

      pweight: WTINT2YR
          VCE: linearized
  Single unit: missing
     Strata 1: sdmvstra
         SU 1: sdmvpsu
        FPC 1: <zero>

svy: mean ridageyr
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      15       Number of obs    =      10537
Number of PSUs   =      31       Population size  =  301943719
                                 Design df        =         16

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
    ridageyr |   36.68331   .5459442      35.52596    37.84066
--------------------------------------------------------------

SAS

proc surveymeans data = demo_f;
cluster sdmvpsu;
strata sdmvstra;
weight wtint2yr;
var ridageyr;
run;

The SURVEYMEANS Procedure

            Data Summary

Number of Strata                  15
Number of Clusters                31
Number of Observations         10537
Sum of Weights             301943719


                                   Statistics

                                               Std Error
Variable               N            Mean         of Mean       95% CL for Mean
---------------------------------------------------------------------------------
RIDAGEYR           10537       36.683305        0.545944    35.5259552 37.8406551
---------------------------------------------------------------------------------

SUDAAN

proc sort data = demo_f;
by sdmvstra sdmvpsu;
run;

proc descript data = demo_f filetype = sas design = wr;
weight wtint2yr;
nest sdmvstra sdmvpsu / missunit;
var ridageyr;
run;

NHANES III (National Health and Nutrition Examination Survey Three)

The data and documentation can obtain from the NHANES website. The NHANES III (1988 – 1994) data sets were released with the variables necessary to correct the standard errors of the estimates by either Taylor series linearization or the replicate weight method. To ensure the privacy of the survey respondents, instead of releasing the actual strata and PSU variables, pseudo-strata and pseudo-PSU variables were released. These are used in the same way that the “real” variables would be used. The data sets also contain balanced-repeated replicate weights (brr). The Fay’s adjustment is 1.7 or .3, depending the statistical package that you are using. Please note that these data sets were released with multiple sampling weights and multiple sets of replicate weights. Care must be taken to ensure that the correct weights are being used with each analysis. The choice of weights depends on the particular data set and variables being analyzed. In the examples below, we use the adult data set. Note that before using the pseudo strata and pseudo PSU variables, the data set must be sorted by the pseudo strata and pseudo PSU. (For the data set containing only the data from 1999-2000, replicate weights using JK-1 are included with the data, with an adjustment of .980769 ( = 51/52). In SUDAAN, the statements would be weight wtmec2yr; jackwgts wtmrep01 – wtmrep52 / adjjack = .980769. See guidelines.pdf for details.)

Stata

* with replicate weights
* NOTE:  You need to use the formula Fay=1-1/sqrt(adjfay) to convert the value of Fay's adjustment
         given in the documentation to the form that Stata wants.  You need to use the -vce(brr)- and -mse- options
         to obtain the standard errors given by SUDAAN. 
display 1-(1/sqrt(1.7))
.23303501

svyset [pweight = wtpfqx6], brrweight(wtpqrp1 - wtpqrp52) fay(.23303501) vce(brr) mse

      pweight: wtpfqx6
          VCE: brr
          MSE: on
    brrweight: wtpqrp1 wtpqrp2 wtpqrp3 wtpqrp4 wtpqrp5 wtpqrp6 wtpqrp7 wtpqrp8 wtpqrp9 wtpqrp10 wtpqrp11 wtpqrp12 wtpqrp13 wtpqrp14
               wtpqrp15 wtpqrp16 wtpqrp17 wtpqrp18 wtpqrp19 wtpqrp20 wtpqrp21 wtpqrp22 wtpqrp23 wtpqrp24 wtpqrp25 wtpqrp26 wtpqrp27
               wtpqrp28 wtpqrp29 wtpqrp30 wtpqrp31 wtpqrp32 wtpqrp33 wtpqrp34 wtpqrp35 wtpqrp36 wtpqrp37 wtpqrp38 wtpqrp39 wtpqrp40
               wtpqrp41 wtpqrp42 wtpqrp43 wtpqrp44 wtpqrp45 wtpqrp46 wtpqrp47 wtpqrp48 wtpqrp49 wtpqrp50 wtpqrp51 wtpqrp52
          fay: .23303501
     Strata 1: <one>
         SU 1: <observations>
        FPC 1: <zero>

svy: mean haznok5r
(running mean on estimation sample)

BRR replications (52)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..

Survey: Mean estimation             Number of obs    =   20014
                                    Population size  = 1.9e+08
                                    Replications     =      52
                                    Design df        =      51

--------------------------------------------------------------
             |                BRR *
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
    haznok5r |   6.851117   .1024657      6.645408    7.056825
--------------------------------------------------------------

* with pseudo-strata and pseudo-PSUs;
svyset sdppsu6 [pweight = wtpfqx6], strata(sdpstra6)

      pweight: wtpfqx6
          VCE: linearized
     Strata 1: sdpstra6
         SU 1: sdppsu6
        FPC 1: <zero>

svy :  mean haznok5r
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      49          Number of obs    =   20014
Number of PSUs   =      98          Population size  = 1.9e+08
                                    Design df        =      49

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
    haznok5r |   6.851117   .1237399      6.602452    7.099781
--------------------------------------------------------------

SAS

* with pseudo-strata and pseudo-PSUs;
proc surveymeans data = adult1;
weight wtpfqx6;
strata sdpstra6;
cluster sdppsu6;
var HAZNOK5R;
run;

The SURVEYMEANS Procedure

            Data Summary

Number of Strata                  49
Number of Clusters                98
Number of Observations         20050
Sum of Weights             187647206


                                   Statistics

                                               Std Error
Variable               N            Mean         of Mean       95% CL for Mean
---------------------------------------------------------------------------------
HAZNOK5R           20014        6.851117        0.123740    6.60245228 7.09978141
---------------------------------------------------------------------------------

* with brr replicate weights;
proc surveymeans data = adult1 varmethod = brr (fay = .23303501);
weight wtpfqx6;
repweights WTPQRP1 - WTPQRP52;
var HAZNOK5R;
run;

The SURVEYMEANS Procedure

            Data Summary

Number of Observations         20050
Sum of Weights             187647206


       Variance Estimation

Method                         BRR
Replicate Weights           ADULT1
Number of Replicates            52
Fay Coefficient         0.23303501


                                   Statistics

                                               Std Error
Variable               N            Mean         of Mean       95% CL for Mean
---------------------------------------------------------------------------------
HAZNOK5R           20014        6.851117        0.102466    6.64550449 7.05672921
---------------------------------------------------------------------------------

SUDAAN

* with brr replicate weights;
proc descript data = adult1 filetype = sas design=brr;
repwgt WTPQRP1 - WTPQRP52 / adjfay = 1.7;
weight WTPFQX6 ;
var HAZNOK5R;
setenv colwidth = 19;
setenv decwidth = 7;
print nsum wsum mean semean  / nohead;
run;

* with pseudo-strata and pseudo-PSUs;
proc sort data = adult1;
by sdpstra6 sdppsu6;
run;

proc descript data = adult1 filetype = sas design = wr;
nest sdpstra6 sdppsu6 / missunit;
weight WTPFQX6 ;
var HAZNOK5R;
setenv colwidth = 19;
setenv decwidth = 7;
print nsum wsum mean semean / nohead;
run;

National Comorbidity Survey

The NCS data and documentation can be obtained from the NCS website. (NOTE: These examples are taken from the DS2: NCS Diagnosis/Demographic Data)

Stata

svyset secu [pweight = p1fwt], strata(str)

      pweight: p1fwt
          VCE: linearized
     Strata 1: str
         SU 1: secu
        FPC 1: <zero>

svy: mean deplt1
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      42          Number of obs    =    8098
Number of PSUs   =      84          Population size  =    8098
                                    Design df        =      42

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
      deplt1 |   .1706523   .0067263      .1570781    .1842266
--------------------------------------------------------------

SAS

A SAS SUGI 27 paper with examples from this data set can be found here .

proc surveymeans data = ncs2;
  strata str;
  cluster secu;
  weight p1fwt;
  var deplt1;
run;

The SURVEYMEANS Procedure

            Data Summary

Number of Strata                  42
Number of Clusters                84
Number of Observations          8098
Sum of Weights             8097.9966


                                   Statistics

                                               Std Error
Variable               N            Mean         of Mean       95% CL for Mean
---------------------------------------------------------------------------------
DEPLT1              8098        0.170652        0.006726    0.15707807 0.18422659
---------------------------------------------------------------------------------

SUDAAN

proc sort data = ncs2;
by str secu;
run;

proc descript data = ncs2 filetype = sas design = wr;
weight p1fwt;
nest str secu;
var deplt1;
setenv colwidth = 12;
setenv decwidth = 6;
print nsum wsum mean semean lowmean upmean;
run;

SIPP (Survey of Income and Program Participation)

The data and file setups can be downloaded from http://www.nber.org/data/sipp.html . A generic example instead of one using real data is shown below.

Stata

svyset [pweight=_yourwgt], brrweight(RepWt_1-RepWt_n) fay(.5) vce(brr) mse

SAS

proc surveymeans data = sipp_data varmethod = brr fay = (.5);
weight _yourwgt;
repweights RepWt_1-RepWt_n;
var _yourvar;
run;

SUDAAN

proc descript data = sipp_data filetype = sas design=brr;
repwgt RepWt_1-RepWt_n / adjfay = 4;
weight _yourwgt;
var _yourvar;
setenv colwidth = 19;
setenv decwidth = 7;
print nsum wsum mean semean  / nohead;
run;

US Census 2000

Census data can be obtained from the Census website. The documentation can be found here . Chapter 5 describes the sampling used, and chapter 4 describes the calculations necessary to obtain the correct standard errors (pages 4-3 to 4-15).

The 2000 US Census was released with person and household weights to weight the sample (either the 1% or the 5% PUMS) back to the national totals. In our examples, we will use the person weights with person level variables. The data are clustered within household; every person within a selected household is included in the sample. For both institutional and non-institutional group quarters, a pseudo household record number was assigned (see pages 2-3 and 3-1 of the documentation). Although it is clearly stated in the documentation that the sample data set was constructed using stratified sampling, the stratification variable was not released with the data set. Furthermore, some of the variables used in the stratification were also not released, so that the stratification variable cannot be reconstructed by the user of the data set. Hence, in our example setup, we will ignore the stratification. Please see chapter 4 of the documentation for instructions on how to obtain correct standard errors. In our examples, we use the 5% PUMS data for California.

Stata

Note that unless you limit the number of variables, you need to use Stata S/E or Stata M/P.

svyset serialno [pweight = pweight]

svy: tab carpool, count se cellwidth(10) format(%15.2g)
(running tabulate on estimation sample)

Number of strata   =         1                  Number of obs      =   1690642
Number of PSUs     =    616115                  Population size    =  33884660
                                                Design df          =    616114

----------------------------------
vehicle   |
occupancy |      count          se
----------+-----------------------
 not in u |   21347148       29827
 drove al |   10418251       15947
 2 people |    1572572        7244
 3 people |     327968        3364
 4 people |     120283        2240
 5 or 6 p |      56246        1505
 7 or mor |      42192        1283
          | 
    Total |   33884660            
----------------------------------
  Key:  count       =  weighted counts
        se          =  linearized standard errors of weighted counts

SAS

proc surveyfreq data = census2000;
weight pweight;
cluster serialno;
tables carpool;
run;

The SURVEYFREQ Procedure

                    Data Summary

Number of Clusters                            616115
Number of Observations                       1690642
Number of Observations Used                  1690362
Number of Obs with Nonpositive Weights           280
Sum of Weights                              33884660


                             vehicle occupancy

                           Weighted    Std Dev of               Std Err of
CARPOOL     Frequency     Frequency      Wgt Freq    Percent       Percent
--------------------------------------------------------------------------
      0       1073728      21347148         29827    62.9994        0.0452
      1        511854      10418251         15947    30.7462        0.0440
      2         77629       1572572          7244     4.6410        0.0207
      3         16250        327968          3364     0.9679        0.0098
      4          5941        120283          2240     0.3550        0.0066
      5          2851         56246          1505     0.1660        0.0044
      6          2109         42192          1283     0.1245        0.0038

  Total       1690362      33884660         35601    100.000
--------------------------------------------------------------------------

SUDAAN

proc sort data = census2000;
by serialno;
run;

proc crosstab data = census2000 filetype = sas design = wr;
nest _one_ serialno;
weight pweight;
class carpool;setenv colwidth = 14;
print nsum wsum sewgt rowper serow;
run;