Simple random sample in Stata
In this example, we are taking a simple random sampling of schools. After loading the data set into Stata, we will use the count command to see how many cases we have in the data file. Next, we will set the seed so that the results are replicable. If you do not set a seed and you run the code a second time, you will get slightly different results because a different sample will be used. To draw the sample, we will use the sample command. A number without the count option indicates the percentage to be sampled from the data in memory. Finally, we will use the count command again to see how many cases (or elements) have been selected into our sample.
clear use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear count 6194
set seed 1003002849 sample 5 count 310
Now we need to create the probability weights. Our sampling fraction is 310/6194, and the inverse of this is the probability weight (which Stata calls a “pweight”. (See Levy and Lemeshow, page 49).
gen pw = 6194/310
Because we are sampling a rather large percentage of our population, we need to set the fpc. Stata will calculate the actual fpc for us; we just need to specify the population total.
gen fpc = 6194
Next, we will recode awards to be a 0/1 variable.
recode awards (1=0) (2=1)
We are now ready to use the svyset command to tell Stata about the features of our survey sampling plan. We will use the svydes command to ensure that everything is OK.
svyset [pweight=pw], fpc(fpc) svydes
pweight: pw Strata: <one> PSU: <observations> FPC: fpc #Obs per PSU Strata ---------------------------- <one> #PSUs #Obs min mean max -------- -------- -------- -------- -------- -------- 1 310 310 1 1.0 1 -------- -------- -------- -------- -------- -------- 1 310 310 1 1.0 1
Stratified random sampling in Stata
The difference between the example above and the example below is that stratification has been added.
We will create a stratification variable to be used in this example. We will create two strata based upon schools’ average api99 score. Schools with an api99 score equal to or below the mean api99 score will be in strata 1 and those with api99 scores above the mean will be in strata 2. You will in the results below that the stratification makes some of the estimates more efficient, while other estimates are not helped by the stratification. The difference has to do with the relationship between the stratification variable and the variable being estimated. If there is a reasonable relationship, then stratification is helpful; if there is not a relationship, stratification does not help. (Even if there is not a relationship between the stratification variable and the variable being estimated, the stratification usually will not make the estimate less efficient than SRS.)
To create the strata variable, which we will call strat, we use the generate command (gen for short) and set strat equal to 1. We will then use the replace command to replace the value of strat with 2 if the value of api99 is greater than 631, which is approximately the mean. Under normal circumstances, you would not make up strata as we are here. Instead, the strata would be naturally occurring in the population, such as gender, race, categories of age or income, etc. We will use the prefix by before the count command and before the sample command. When used before the sample command, by tells Stata to select the sample independently from each strata.
summ api99
Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- api99 | 6194 631.913 132.4346 302 966
gen strat = 1 replace strat = 2 if api99 > 631 (3095 real changes made)
sort strat by strat: count _______________________________________________________________________________ -> strat = 1 3099 _______________________________________________________________________________ -> strat = 2 3095 set seed 1003002849 by strat: sample 10 (5574 observations deleted) sort strat by strat: count _______________________________________________________________________________ -> strat = 1 310 _______________________________________________________________________________ -> strat = 2 310
Now that we know how many elements are in each strata, we can compute the probability weights. We will use the same formula as before: N/n, where N is the total number in the population (in this case, the total number of elements in the population in that particular strata), and n is the number of elements selected into the sample in that strata. (See Levy and Lemeshow, page 122)
gen pw = 3099/310 if strat == 1 replace pw = 3095/310 if strat == 2
We also use the population totals in each strata to create the fpc variable, which we have again called fpc.
gen fpc = 3099 if strat == 1 replace fpc = 3095 if strat == 2
Now we will recode awards from a 1/2 variable to a 0/1 variable, so that its interpretation in regression analyses is simpler.
recode awards (1=0) (2=1)
The svyset, clear(all) command is not really necessary, but is included to show how it could be used. Finally, we use svyset and check it with svydes.
svyset, clear(all) svyset [pweight = pw], strata(strat) fpc(fpc) svydes
pweight: pw Strata: strat PSU: <observations> FPC: fpc #Obs per PSU Strata ---------------------------- strat #PSUs #Obs min mean max -------- -------- -------- -------- -------- -------- 1 310 310 1 1.0 1 2 310 310 1 1.0 1 -------- -------- -------- -------- -------- -------- 2 620 620 1 1.0 1
Systematic sampling
There are 6194 schools in our sample, and we would like to use systematic sampling to select a sample of size 500. Therefore, k = 6194/500 = 13, meaning that we will select every 13th school. Now we need to randomly select the number from which to start. To do this, we will take the integer part (obtained with the int function) of a random number (obtained with the uniform() function). We will multiply it by 13, because we want 13 to be the upper limit of the numbers generated. We will add 1 to our random number, because the number returned by the uniform() function will range from 0 to 12.9999999. If we add 1 and take the integer part of the number (the part before the decimal point), we will get a random number between 1 and 13. The number randomly selected was 4. Hence, we will begin selecting into our sample every 13th school starting with school number 4. (See Levy and Lemeshow, page 83)
set seed 37 display int(uniform()*13)+1 4
To actually select the sample, we will sort the data by snum (school number), drop the first three schools (because we want to start with school number 4), and then generate a new variable, which we called y, that is the modulus (i.e., the remainder after division) of the school number divided by 13. We drop all of the cases for which y is not equal to 0 and use the count command to determine how many schools are in our sample.
use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear
sort snum drop if _n < 4 gen newsno = _n - 1 gen y = mod(newsno,13) drop if y != 0 count
477
display 6194/13
476.46154
Now we need to create the p-weights and set the fpc.
gen pw = 6194/477 gen fpc = 6194
We will recode and relabel the variable awards.
recode awards (1=0) (2=1) label define yesno 0 no 1 yes label values awards yesno
svyset [pweight = pw], fpc(fpc) svydes
pweight: pw Strata: <one> PSU: <observations> FPC: fpc #Obs per PSU Strata ---------------------------- <one> #PSUs #Obs min mean max -------- -------- -------- -------- -------- -------- 1 477 477 1 1.0 1 -------- -------- -------- -------- -------- -------- 1 477 477 1 1.0 1
One-stage cluster sampling
In our one-stage cluster sample, the districts will be the cluster and the schools will be the elementary or sampling units. We have decided to use simple random sampling to select our clusters. Hence, we randomly select school districts and then select all schools within each selected district.
use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear
Next, we need to create a new data set with just one row for each dnum so that we can do the sampling. We will use the contract command to do this. We will get the total number of districts now for use in the calculation of the probability weight later. Next, we will create a new variable, x, with random values; sort the data on x; set the seed; select the sample; and see how many districts were selected. We resort the data on dnum and save the data set for use in the second part of creating the sample.
contract dnum count
757
set seed 1002 sample 25 count
189
sort dnum keep dnum save "D:\oscs.dta", replace
Now that we know which districts have been selected to be in our sample, we need to put that information into the full data set. (Remember that the data set that we just created does not contain the information for each school in the district.) We sort the file on dnum (notice that this is the same variable that we sorted the other data file on), and then merge the two files. We drop all cases that do not match and see that we have 1461 cases selected into our sample.
use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear sort dnum merge dnum using "D:oscs.dta" drop if _merge != 3 count
1463
Now we need to create the probability weight and fpc. Remember that the probability weight is based on the number of districts in the population and the number of districts selected into the sample, not the number cases (e.g., schools) in the sample. (See Levy and Lemeshow, page 247)
gen pw = (757/189) gen fpc = 757 svyset dnum [pweight = pw], fpc(fpc) svydes
pweight: pw Strata: <one> PSU: dnum FPC: fpc #Obs per PSU Strata ---------------------------- <one> #PSUs #Obs min mean max -------- -------- -------- -------- -------- -------- 1 189 1463 1 7.7 100 -------- -------- -------- -------- -------- -------- 1 189 1463 1 7.7 100
NOTE: There is a substantial amount of variability from one sample to the next. In some samples, the maximum number of observations per PSU is 552 and the design effects are as high as 140 for some estimates.
Two-stage cluster sampling with stratification
To select this sample, we are going to break up the process into four parts. First we will create the strata; second, we will do the first- and second-stage sampling in strata 1; third, we will repeat the process in strata 2; fourth, we will concatenate the files for strata 1 and strata 2 to create the file working data file.
NOTE: In most cases, you will not have to create the strata yourself. Instead, they will already be defined for you: perhaps you will use variables such as gender and race to create your strata. We show the creation of the strata here because occasionally (such as for teaching purposes) you may have to do this, and there are some tricky issues involved.
NOTE2: Because we are using the same sampling procedure in each strata, we could do the sampling for both strata at the same time, using the by() option on the sample command. However, it is common not to use the same sampling design in each strata. For example, you may over-sample individuals of a particular group because you are interested in obtaining a more precise estimate for that group, or because there are relatively few members in the group. For this reason, we show the sampling individually for each strata. Also, we tried to use code that could be easily adapted to other situations, even if it is not the most parsimonious code possible for our example.
Part 1: Creating the strata
We will create the strata in a manner similar to that used in the previous in the example with stratified random sampling. We will save one file with only the cases for strata 1 and a different file for the cases for strata 2. However, to determine the cutoff point for which districts should be in strata 1 and which should be in strata 2, we will need to use a slightly different procedure than the one we used before. This change is necessary because we are now stratifying the school districts, whereas before we were stratifying schools themselves. The find the appropriate cutpoint, we will get the mean api99 score for each school district, collapse the data file so that there is only one observation per combination of district numbers and means, and then locate the mean.
* determining the cutpoint between the 2 strata use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear egen mean = mean(api99), by(dnum) contract dnum mean sum mean * 650 histogram mean, xline(650) normal xlabel(350(50)950) freq
* creating the data file for strata 1 use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear egen mean = mean(api99), by(dnum) gen strata = 1 replace strata = 2 if mean > 650 drop if strata == 2 save apipops1.dta, replace
* creating the data file for strata 2 use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear egen mean = mean(api99), by(dnum) gen strata = 1 replace strata = 2 if mean > 650 drop if strata == 1 save apipops2.dta, replace
Now we will select our sample in strata 1. We will begin be determining how many cases and clusters we have. To make the process a little simpler, the output from most commands has been included as a comment immediately below the command.
* working in strata 1 use apipops1.dta, clear count * 3644 cases codebook dnum * 377 clusters sort dnum by dnum: gen n = _n summ n * 1 to 552
We will select the clusters in the same way that we did for the one-stage cluster sample.
contract dnum count * 377 sample 25 count * 94 school districts sort dnum save oscss1.dta, replace use apipops1.dta, clear sort dnum merge dnum using oscss1.dta drop if _merge != 3 count * 837 schools sort dnum by dnum: gen n = _n summ n * 1 to 100
Now we will select the schools from each district. We have decided to select three schools from each district. If a district has three or fewer schools, we will select into the sample all of the schools. To do this, we will create a new random variable, xx, by dnum, and sort on it. We next create new variables that tell us the number of schools within each district (number), and the total number of schools within each district (N). After dropping the schools with numbers greater than 3, we sort the data on dnum and number. Finally, we generate nn, which is the total number of selected schools in each district.
by dnum: gen xx = uniform() sort dnum xx by dnum: gen number = _n by dnum: gen N = _N drop if number > 3 count * 227 schools sampled sort dnum number by dnum: gen nn = _N
Now we are ready to create the probability weights and fpc. For a two-stage sample, the formula for the probability weights is f1*f2, where f1 is the inverse of the sampling fraction for level 1 (selecting the clusters) and f2 is the inverse of the sampling fraction for level 2 (selecting the elements). (See Levy and Lemeshow, page 280)
gen p1 = 377/94 gen p2 = N/nn gen pwt = p1*p2 gen fpc = 377 save strata1.dta, replace
Now we are ready to select the sample in strata 2. Although we could alter the sampling plan at either level 1 or level 2, or both, we will follow the same procedure that we used in strata 1.
* working in strata 2 use apipops2.dta, clear count * 2550 cases codebook dnum * 380 clusters sort dnum by dnum: gen n = _n summ n * 1 to 142
* selecting clusters contract dnum count *380 sample 25 count *95 school districts sort dnum save oscss2.dta, replace use apipops2.dta, clear sort dnum merge dnum using oscss2.dta drop if _merge != 3 count * 669 schools sort dnum by dnum: gen n = _n summ n * 1 to 72
* selecting schools within districts by dnum: gen xx = uniform() sort dnum xx by dnum: gen number = _n by dnum: gen N = _N drop if number > 3 count * 239 schools sampled sort dnum number by dnum: gen nn = _N * creating the pweights and fpc gen p1 = 380/95 gen p2 = N/nn gen pwt = p1*p2 gen fpc = 380 save strata2.dta, replace
At long last, we are ready to concatenate (stack) the data sets (the 2 strata) together. We will also create some variables that we will need for the analyses shown in this section.
append using strata1.dta count * 466 gen comp_imp1 = comp_imp - 1 recode awards (1 = 0) ( 2= 1) gen meals3 = 2 replace meals3 = 1 if meals <= 33 replace meals3 = 3 if meals > 67 save strataboth.dta, replace svyset dnum [pweight = pwt], fpc(fpc) strata(strata) svydes
pweight: pwt Strata: strata PSU: dnum FPC: fpc #Obs per PSU Strata ---------------------------- strata #PSUs #Obs min mean max -------- -------- -------- -------- -------- -------- 1 94 227 1 2.4 3 2 95 239 1 2.5 3 -------- -------- -------- -------- -------- -------- 2 189 466 1 2.5 3