Survey Data Analysis in StataSelecting the sample

Simple random sample in Stata

In this example, we are taking a simple random sampling of schools. After loading the data set into Stata, we will use the count command to see how many cases we have in the data file. Next, we will set the seed so that the results are replicable. If you do not set a seed and you run the code a second time, you will get slightly different results because a different sample will be used. To draw the sample, we will use the sample command. A number without the count option indicates the percentage to be sampled from the data in memory. Finally, we will use the count command again to see how many cases (or elements) have been selected into our sample.

clear
use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear
count

 6194

set seed 1003002849
sample 5
count

 310

Now we need to create the probability weights. Our sampling fraction is 310/6194, and the inverse of this is the probability weight (which Stata calls a “pweight”. (See Levy and Lemeshow, page 49).

gen pw = 6194/310

Because we are sampling a rather large percentage of our population, we need to set the fpc. Stata will calculate the actual fpc for us; we just need to specify the population total.

gen fpc = 6194

Next, we will recode awards to be a 0/1 variable.

recode awards (1=0) (2=1)

We are now ready to use the svyset command to tell Stata about the features of our survey sampling plan. We will use the svydes command to ensure that everything is OK.

svyset [pweight=pw], fpc(fpc)
svydes

pweight:  pw
Strata:   <one>
PSU:      <observations>
FPC:      fpc
                                      #Obs per PSU
 Strata                       ----------------------------
  <one>     #PSUs     #Obs       min      mean       max
--------  --------  --------  --------  --------  --------
       1       310       310         1       1.0         1
--------  --------  --------  --------  --------  --------
       1       310       310         1       1.0         1

Stratified random sampling in Stata

The difference between the example above and the example below is that stratification has been added.

We will create a stratification variable to be used in this example. We will create two strata based upon schools’ average api99 score. Schools with an api99 score equal to or below the mean api99 score will be in strata 1 and those with api99 scores above the mean will be in strata 2. You will in the results below that the stratification makes some of the estimates more efficient, while other estimates are not helped by the stratification. The difference has to do with the relationship between the stratification variable and the variable being estimated. If there is a reasonable relationship, then stratification is helpful; if there is not a relationship, stratification does not help. (Even if there is not a relationship between the stratification variable and the variable being estimated, the stratification usually will not make the estimate less efficient than SRS.)

To create the strata variable, which we will call strat, we use the generate command (gen for short) and set strat equal to 1. We will then use the replace command to replace the value of strat with 2 if the value of api99 is greater than 631, which is approximately the mean. Under normal circumstances, you would not make up strata as we are here. Instead, the strata would be naturally occurring in the population, such as gender, race, categories of age or income, etc. We will use the prefix by before the count command and before the sample command. When used before the sample command, by tells Stata to select the sample independently from each strata.

summ api99

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       api99 |      6194     631.913    132.4346        302        966

gen strat = 1
replace strat = 2 if api99 > 631 
(3095 real changes made)

sort strat
by strat: count
_______________________________________________________________________________
-> strat = 1
 3099
_______________________________________________________________________________
-> strat = 2
 3095

set seed 1003002849
by strat: sample 10
(5574 observations deleted)
sort strat
by strat: count
_______________________________________________________________________________
-> strat = 1
  310
_______________________________________________________________________________
-> strat = 2
  310

Now that we know how many elements are in each strata, we can compute the probability weights. We will use the same formula as before: N/n, where N is the total number in the population (in this case, the total number of elements in the population in that particular strata), and n is the number of elements selected into the sample in that strata. (See Levy and Lemeshow, page 122)

gen pw = 3099/310 if strat == 1
replace pw = 3095/310 if strat == 2

We also use the population totals in each strata to create the fpc variable, which we have again called fpc.

gen fpc = 3099 if strat == 1
replace fpc = 3095 if strat == 2

Now we will recode awards from a 1/2 variable to a 0/1 variable, so that its interpretation in regression analyses is simpler.

recode awards (1=0) (2=1)

The svyset, clear(all) command is not really necessary, but is included to show how it could be used. Finally, we use svyset and check it with svydes.

svyset, clear(all)
svyset [pweight = pw], strata(strat) fpc(fpc)
svydes

pweight:  pw
Strata:   strat
PSU:      <observations>
FPC:      fpc
                                      #Obs per PSU
 Strata                       ----------------------------
  strat     #PSUs     #Obs       min      mean       max
--------  --------  --------  --------  --------  --------
       1       310       310         1       1.0         1
       2       310       310         1       1.0         1
--------  --------  --------  --------  --------  --------
       2       620       620         1       1.0         1

Systematic sampling

There are 6194 schools in our sample, and we would like to use systematic sampling to select a sample of size 500. Therefore, k = 6194/500 = 13, meaning that we will select every 13th school. Now we need to randomly select the number from which to start. To do this, we will take the integer part (obtained with the int function) of a random number (obtained with the uniform() function). We will multiply it by 13, because we want 13 to be the upper limit of the numbers generated. We will add 1 to our random number, because the number returned by the uniform() function will range from 0 to 12.9999999. If we add 1 and take the integer part of the number (the part before the decimal point), we will get a random number between 1 and 13. The number randomly selected was 4. Hence, we will begin selecting into our sample every 13th school starting with school number 4. (See Levy and Lemeshow, page 83)

set seed 37
display int(uniform()*13)+1
4

To actually select the sample, we will sort the data by snum (school number), drop the first three schools (because we want to start with school number 4), and then generate a new variable, which we called y, that is the modulus (i.e., the remainder after division) of the school number divided by 13. We drop all of the cases for which y is not equal to 0 and use the count command to determine how many schools are in our sample.

use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear

sort snum
drop if _n < 4
gen newsno = _n - 1
gen y = mod(newsno,13)
drop if y != 0
count

display 6194/13

476.46154

Now we need to create the p-weights and set the fpc.

gen pw = 6194/477
gen fpc = 6194

We will recode and relabel the variable awards.

recode awards (1=0) (2=1)
label define yesno 0 no 1 yes
label values awards yesno

svyset [pweight = pw], fpc(fpc)
svydes

pweight:  pw
Strata:   <one>
PSU:      <observations>
FPC:      fpc
                                      #Obs per PSU
 Strata                       ----------------------------
  <one>     #PSUs     #Obs       min      mean       max
--------  --------  --------  --------  --------  --------
       1       477       477         1       1.0         1
--------  --------  --------  --------  --------  --------
       1       477       477         1       1.0         1

One-stage cluster sampling

In our one-stage cluster sample, the districts will be the cluster and the schools will be the elementary or sampling units. We have decided to use simple random sampling to select our clusters. Hence, we randomly select school districts and then select all schools within each selected district.

use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear

Next, we need to create a new data set with just one row for each dnum so that we can do the sampling. We will use the contract command to do this. We will get the total number of districts now for use in the calculation of the probability weight later. Next, we will create a new variable, x, with random values; sort the data on x; set the seed; select the sample; and see how many districts were selected. We resort the data on dnum and save the data set for use in the second part of creating the sample.

contract dnum
count

set seed 1002
sample 25
count

sort dnum
keep dnum
save "D:\oscs.dta", replace

Now that we know which districts have been selected to be in our sample, we need to put that information into the full data set. (Remember that the data set that we just created does not contain the information for each school in the district.) We sort the file on dnum (notice that this is the same variable that we sorted the other data file on), and then merge the two files. We drop all cases that do not match and see that we have 1461 cases selected into our sample.

use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear
sort dnum
merge dnum using "D:oscs.dta"
drop if _merge != 3
count

Now we need to create the probability weight and fpc. Remember that the probability weight is based on the number of districts in the population and the number of districts selected into the sample, not the number cases (e.g., schools) in the sample. (See Levy and Lemeshow, page 247)

gen pw = (757/189)
gen fpc = 757

svyset dnum [pweight = pw], fpc(fpc)
svydes

pweight:  pw
Strata:   <one>
PSU:      dnum
FPC:      fpc
                                      #Obs per PSU
 Strata                       ----------------------------
  <one>     #PSUs     #Obs       min      mean       max
--------  --------  --------  --------  --------  --------
       1       189      1463         1       7.7       100
--------  --------  --------  --------  --------  --------
       1       189      1463         1       7.7       100

NOTE: There is a substantial amount of variability from one sample to the next. In some samples, the maximum number of observations per PSU is 552 and the design effects are as high as 140 for some estimates.

Two-stage cluster sampling with stratification

To select this sample, we are going to break up the process into four parts. First we will create the strata; second, we will do the first- and second-stage sampling in strata 1; third, we will repeat the process in strata 2; fourth, we will concatenate the files for strata 1 and strata 2 to create the file working data file.

NOTE: In most cases, you will not have to create the strata yourself. Instead, they will already be defined for you: perhaps you will use variables such as gender and race to create your strata. We show the creation of the strata here because occasionally (such as for teaching purposes) you may have to do this, and there are some tricky issues involved.

NOTE2: Because we are using the same sampling procedure in each strata, we could do the sampling for both strata at the same time, using the by() option on the sample command. However, it is common not to use the same sampling design in each strata. For example, you may over-sample individuals of a particular group because you are interested in obtaining a more precise estimate for that group, or because there are relatively few members in the group. For this reason, we show the sampling individually for each strata. Also, we tried to use code that could be easily adapted to other situations, even if it is not the most parsimonious code possible for our example.

Part 1: Creating the strata

We will create the strata in a manner similar to that used in the previous in the example with stratified random sampling. We will save one file with only the cases for strata 1 and a different file for the cases for strata 2. However, to determine the cutoff point for which districts should be in strata 1 and which should be in strata 2, we will need to use a slightly different procedure than the one we used before. This change is necessary because we are now stratifying the school districts, whereas before we were stratifying schools themselves. The find the appropriate cutpoint, we will get the mean api99 score for each school district, collapse the data file so that there is only one observation per combination of district numbers and means, and then locate the mean.

* determining the cutpoint between the 2 strata
use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear
egen mean = mean(api99), by(dnum)
contract dnum mean
sum mean
* 650
histogram mean, xline(650) normal xlabel(350(50)950) freq

* creating the data file for strata 1
use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear
egen mean = mean(api99), by(dnum)
gen strata = 1
replace strata = 2 if mean > 650
drop if strata == 2
save apipops1.dta, replace

* creating the data file for strata 2
use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear
egen mean = mean(api99), by(dnum)
gen strata = 1
replace strata = 2 if mean > 650
drop if strata == 1
save apipops2.dta, replace

Now we will select our sample in strata 1. We will begin be determining how many cases and clusters we have. To make the process a little simpler, the output from most commands has been included as a comment immediately below the command.

* working in strata 1
use apipops1.dta, clear
count
* 3644 cases
codebook dnum
* 377 clusters
sort dnum
by dnum: gen n = _n
summ n
* 1 to 552

We will select the clusters in the same way that we did for the one-stage cluster sample.

contract dnum
count
* 377
sample 25
count
* 94 school districts
sort dnum
save oscss1.dta, replace
use apipops1.dta, clear
sort dnum
merge dnum using oscss1.dta
drop if _merge != 3
count
* 837 schools 
sort dnum
by dnum: gen n = _n
summ n
* 1 to 100

Now we will select the schools from each district. We have decided to select three schools from each district. If a district has three or fewer schools, we will select into the sample all of the schools. To do this, we will create a new random variable, xx, by dnum, and sort on it. We next create new variables that tell us the number of schools within each district (number), and the total number of schools within each district (N). After dropping the schools with numbers greater than 3, we sort the data on dnum and number. Finally, we generate nn, which is the total number of selected schools in each district.

by dnum: gen xx = uniform()
sort dnum xx
by dnum: gen number = _n
by dnum: gen N = _N
drop if number > 3
count
* 227 schools sampled
sort dnum number
by dnum: gen nn = _N

Now we are ready to create the probability weights and fpc. For a two-stage sample, the formula for the probability weights is f₁*f₂, where f₁ is the inverse of the sampling fraction for level 1 (selecting the clusters) and f₂ is the inverse of the sampling fraction for level 2 (selecting the elements). (See Levy and Lemeshow, page 280)

gen p1 = 377/94
gen p2 = N/nn
gen pwt = p1*p2
gen fpc = 377
save strata1.dta, replace

Now we are ready to select the sample in strata 2. Although we could alter the sampling plan at either level 1 or level 2, or both, we will follow the same procedure that we used in strata 1.

* working in strata 2
use apipops2.dta, clear
count
* 2550 cases
codebook dnum
* 380 clusters
sort dnum
by dnum: gen n = _n
summ n
* 1 to 142

* selecting clusters
contract dnum
count
*380
sample 25
count
*95 school districts
sort dnum
save oscss2.dta, replace
use apipops2.dta, clear
sort dnum
merge dnum using oscss2.dta
drop if _merge != 3
count
* 669 schools 
sort dnum
by dnum: gen n = _n
summ n
* 1 to 72

* selecting schools within districts
by dnum: gen xx = uniform()
sort dnum xx
by dnum: gen number = _n
by dnum: gen N = _N
drop if number > 3
count
* 239 schools sampled
sort dnum number
by dnum: gen nn = _N
* creating the pweights and fpc
gen p1 = 380/95
gen p2 = N/nn
gen pwt = p1*p2
gen fpc = 380
save strata2.dta, replace

At long last, we are ready to concatenate (stack) the data sets (the 2 strata) together. We will also create some variables that we will need for the analyses shown in this section.

append using strata1.dta
count
* 466
gen comp_imp1 = comp_imp - 1
recode awards (1 = 0) ( 2= 1)
gen meals3 = 2
replace meals3 = 1 if meals <= 33
replace meals3 = 3 if meals > 67
save strataboth.dta, replace
svyset dnum [pweight = pwt], fpc(fpc) strata(strata)
svydes

pweight:  pwt
Strata:   strata
PSU:      dnum
FPC:      fpc
                                      #Obs per PSU
 Strata                       ----------------------------
 strata     #PSUs     #Obs       min      mean       max
--------  --------  --------  --------  --------  --------
       1        94       227         1       2.4         3
       2        95       239         1       2.5         3
--------  --------  --------  --------  --------  --------
       2       189       466         1       2.5         3