## Simple random sample in Stata

In this example, we are taking a simple random sampling of schools.
After loading the data set into Stata, we will use the **count** command to
see how many cases we have in the data file. Next, we will set the seed so that the results are replicable.
If you do not set a seed and you run the code a second time, you will get
slightly different results because a different sample will be used. To
draw the sample, we will use the **sample** command. A number without the
**count** option indicates the percentage to be sampled from the data in memory.
Finally, we will use the **count** command again to see how many cases (or
elements) have been selected into our sample.

clear use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear count6194

set seed 1003002849 sample 5 count310

Now we need to create the probability weights. Our sampling fraction is 310/6194, and the inverse of this is the probability weight (which Stata calls a “pweight”. (See Levy and Lemeshow, page 49).

gen pw = 6194/310

Because we are sampling a rather large percentage of our population, we need to set the fpc. Stata will calculate the actual fpc for us; we just need to specify the population total.

gen fpc = 6194

Next, we will recode awards to be a 0/1 variable.

recode awards (1=0) (2=1)

We are now ready to use the **svyset** command to tell Stata about the
features of our survey sampling plan. We will use the **svydes**
command to ensure that everything is OK.

svyset [pweight=pw], fpc(fpc) svydes

pweight: pw Strata: <one> PSU: <observations> FPC: fpc #Obs per PSU Strata ---------------------------- <one> #PSUs #Obs min mean max -------- -------- -------- -------- -------- -------- 1 310 310 1 1.0 1 -------- -------- -------- -------- -------- -------- 1 310 310 1 1.0 1

## Stratified random sampling in Stata

The difference between the example above and the example below is that stratification has been added.

We will create a stratification variable to be used in this example. We
will create two strata based upon schools’ average **api99** score.
Schools with an api99 score equal to or below the mean **api99** score will be in
strata 1 and those with **api99** scores above the mean will be in strata 2.
You will in the results below that the stratification makes some of the
estimates more efficient, while other estimates are not helped by the
stratification. The difference has to do with the relationship between the
stratification variable and the variable being estimated. If there is a
reasonable relationship, then stratification is helpful; if there is not a
relationship, stratification does not help. (Even if there is not a
relationship between the stratification variable and the variable being
estimated, the stratification usually will not make the estimate less efficient
than SRS.)

To create the strata variable, which we will call **strat**, we use the **
generate** command (**gen** for short) and set **strat** equal to 1.
We will then use the **replace** command to replace the value of **strat** with 2
if the value of **api99** is greater than 631, which is approximately the
mean.
Under normal circumstances, you would not make up strata as we are here.
Instead, the strata would be naturally occurring in the population, such as
gender, race, categories of age or income, etc. We will use the prefix **
by** before the **count** command and before the **sample** command.
When used before the sample command, **by** tells Stata to select the sample
independently from each strata.

summ api99

Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- api99 | 6194 631.913 132.4346 302 966

gen strat = 1 replace strat = 2 if api99 > 631(3095 real changes made)

sort strat by strat: count_______________________________________________________________________________ -> strat = 1 3099 _______________________________________________________________________________ -> strat = 2 3095set seed 1003002849 by strat: sample 10(5574 observations deleted)sort strat by strat: count_______________________________________________________________________________ -> strat = 1 310 _______________________________________________________________________________ -> strat = 2 310

Now that we know how many elements are in each strata, we can compute the probability weights. We will use the same formula as before: N/n, where N is the total number in the population (in this case, the total number of elements in the population in that particular strata), and n is the number of elements selected into the sample in that strata. (See Levy and Lemeshow, page 122)

gen pw = 3099/310 if strat == 1 replace pw = 3095/310 if strat == 2

We also use the population totals in each strata to create the fpc variable,
which we have again called **fpc**.

gen fpc = 3099 if strat == 1 replace fpc = 3095 if strat == 2

Now we will recode **awards** from a 1/2 variable to a 0/1 variable, so
that its interpretation in regression analyses is simpler.

recode awards (1=0) (2=1)

The **svyset, clear(all)** command is not really necessary, but is included to show
how it could be used. Finally, we use **svyset** and check it with **
svydes**.

svyset, clear(all) svyset [pweight = pw], strata(strat) fpc(fpc) svydes

pweight: pw Strata: strat PSU: <observations> FPC: fpc #Obs per PSU Strata ---------------------------- strat #PSUs #Obs min mean max -------- -------- -------- -------- -------- -------- 1 310 310 1 1.0 1 2 310 310 1 1.0 1 -------- -------- -------- -------- -------- -------- 2 620 620 1 1.0 1

## Systematic sampling

There are 6194 schools in our sample, and we would like to use systematic sampling to select a sample of size 500. Therefore, k = 6194/500 = 13, meaning that we will select every 13th school. Now we need to randomly select the number from which to start.
To do this, we will take the integer part (obtained with the **int**
function) of a random number (obtained with the **uniform()** function).
We will multiply it by 13, because we want 13 to be the upper limit of the
numbers generated. We will add 1 to our random number, because the number
returned by the uniform() function will range from 0 to 12.9999999. If we
add 1 and take the integer part of the number (the part before the decimal
point), we will get a random number between 1 and 13. The number randomly
selected was 4. Hence, we will begin selecting into our sample every
13th school starting with school number 4. (See Levy and Lemeshow, page 83)

set seed 37 display int(uniform()*13)+14

To actually select the sample, we will sort the data by **snum** (school
number), drop the first three schools (because we want to start with school
number 4), and then generate a new variable, which we called **y**, that is
the modulus (i.e., the remainder after division) of the school number divided by
13. We drop all of the cases for which y is not equal to 0 and use the **
count** command to determine how many schools are in our sample.

use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear

sort snum drop if _n < 4 gen newsno = _n - 1 gen y = mod(newsno,13) drop if y != 0 count

477

display 6194/13

476.46154

Now we need to create the p-weights and set the fpc.

gen pw = 6194/477 gen fpc = 6194

We will recode and relabel the variable **awards**.

recode awards (1=0) (2=1) label define yesno 0 no 1 yes label values awards yesno

svyset [pweight = pw], fpc(fpc) svydes

pweight: pw Strata: <one> PSU: <observations> FPC: fpc #Obs per PSU Strata ---------------------------- <one> #PSUs #Obs min mean max -------- -------- -------- -------- -------- -------- 1 477 477 1 1.0 1 -------- -------- -------- -------- -------- -------- 1 477 477 1 1.0 1

## One-stage cluster sampling

In our one-stage cluster sample, the districts will be the cluster and the schools will be the elementary or sampling units. We have decided to use simple random sampling to select our clusters. Hence, we randomly select school districts and then select all schools within each selected district.

use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear

Next, we need to create a new data set with just one row for each **dnum**
so that we can do the sampling. We will use the **contract** command to
do this. We will get the total number of districts now for use in the
calculation of the probability weight later. Next, we will create a new variable, **
x**, with random values; sort the data on **x**; set the seed; select the
sample; and see how many districts were selected. We resort the data
on **dnum** and save the data set for use in the second part of creating the
sample.

contract dnum count

757

set seed 1002 sample 25 count

189

sort dnum keep dnum save "D:\oscs.dta", replace

Now that we know which districts have been selected to be in our sample, we
need to put that information into the full data set. (Remember that the
data set that we just created does not contain the information for each school
in the district.) We sort the file on **dnum** (notice that this is the
same variable that we sorted the other data file on), and then merge the two
files. We drop all cases that do not match and see that we have 1461 cases
selected into our sample.

use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clearsort dnum merge dnum using "D:oscs.dta" drop if _merge != 3 count

1463

Now we need to create the probability weight and fpc. Remember that the probability weight is based on the number of districts in the population and the number of districts selected into the sample, not the number cases (e.g., schools) in the sample. (See Levy and Lemeshow, page 247)

gen pw = (757/189) gen fpc = 757 svyset dnum [pweight = pw], fpc(fpc) svydes

pweight: pw Strata: <one> PSU: dnum FPC: fpc #Obs per PSU Strata ---------------------------- <one> #PSUs #Obs min mean max -------- -------- -------- -------- -------- -------- 1 189 1463 1 7.7 100 -------- -------- -------- -------- -------- -------- 1 189 1463 1 7.7 100

NOTE: There is a substantial amount of variability from one sample to the next. In some samples, the maximum number of observations per PSU is 552 and the design effects are as high as 140 for some estimates.

## Two-stage cluster sampling with stratification

To select this sample, we are going to break up the process into four parts. First we will create the strata; second, we will do the first- and second-stage sampling in strata 1; third, we will repeat the process in strata 2; fourth, we will concatenate the files for strata 1 and strata 2 to create the file working data file.

NOTE: In most cases, you will not have to create the strata yourself.
Instead, they will already be defined for you: perhaps you will use
variables such as **gender** and **race** to create your strata. We
show the creation of the strata here because occasionally (such as for teaching
purposes) you may have to do this, and there are some tricky issues involved.

NOTE2: Because we are using the same sampling procedure in each strata,
we could do the sampling for both strata at the same time, using the **by()**
option on the sample command. However, it is common not to use the same
sampling design in each strata. For example, you may over-sample
individuals of a particular group because you are interested in obtaining a more
precise estimate for that group, or because there are relatively few members in
the group. For this reason, we show the sampling individually for each
strata. Also, we tried to use code that could be easily adapted to other
situations, even if it is not the most parsimonious code possible for our
example.

Part 1: Creating the strata

We will create the strata in a manner similar to that used in the previous in the
example with stratified random sampling. We will save one file with only
the cases for strata 1 and a different file for the cases for strata 2.
However, to determine the cutoff point for which districts should be in strata 1
and which should be in strata 2, we will need to use a slightly different
procedure than the one we used before. This change is necessary because we
are now stratifying the school districts, whereas before we were stratifying
schools themselves. The find the appropriate cutpoint, we will get the
mean **api99** score for each school district, collapse the data file so that
there is only one observation per combination of district numbers and means, and
then locate the mean.

* determining the cutpoint between the 2 stratause https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear egen mean = mean(api99), by(dnum) contract dnum mean sum mean* 650histogram mean, xline(650) normal xlabel(350(50)950) freq

* creating the data file for strata 1use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear egen mean = mean(api99), by(dnum) gen strata = 1 replace strata = 2 if mean > 650 drop if strata == 2 save apipops1.dta, replace

* creating the data file for strata 2use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear egen mean = mean(api99), by(dnum) gen strata = 1 replace strata = 2 if mean > 650 drop if strata == 1 save apipops2.dta, replace

Now we will select our sample in strata 1. We will begin be determining how many cases and clusters we have. To make the process a little simpler, the output from most commands has been included as a comment immediately below the command.

* working in strata 1use apipops1.dta, clear count* 3644 casescodebook dnum* 377 clusterssort dnum by dnum: gen n = _n summ n* 1 to 552

We will select the clusters in the same way that we did for the one-stage cluster sample.

contract dnum count* 377sample 25 count* 94 school districtssort dnum save oscss1.dta, replace use apipops1.dta, clear sort dnum merge dnum using oscss1.dta drop if _merge != 3 count* 837 schoolssort dnum by dnum: gen n = _n summ n* 1 to 100

Now we will select the schools from each district. We have decided to
select three schools from each district. If a district has three or fewer
schools, we will select into the sample all of the schools. To do this, we
will create a new random variable, **xx**, by **dnum**, and sort on it.
We next create new variables that tell us the number of schools within each
district (**number**), and the total number of schools within each district (**N**).
After dropping the schools with numbers greater than 3, we sort the data on **
dnum** and **number**. Finally, we generate **nn**, which is the
total number of selected schools in each district.

by dnum: gen xx = uniform() sort dnum xx by dnum: gen number = _n by dnum: gen N = _N drop if number > 3 count* 227 schools sampledsort dnum number by dnum: gen nn = _N

Now we are ready to create the probability weights and fpc. For a two-stage
sample, the formula for the probability weights is f_{1}*f_{2}, where f_{1}
is the inverse of the sampling fraction for level 1 (selecting the clusters) and
f_{2} is the inverse of the sampling fraction for level 2 (selecting the
elements). (See Levy and Lemeshow, page 280)

gen p1 = 377/94 gen p2 = N/nn gen pwt = p1*p2 gen fpc = 377 save strata1.dta, replace

Now we are ready to select the sample in strata 2. Although we could alter the sampling plan at either level 1 or level 2, or both, we will follow the same procedure that we used in strata 1.

* working in strata 2use apipops2.dta, clear count* 2550 casescodebook dnum* 380 clusterssort dnum by dnum: gen n = _n summ n* 1 to 142

* selecting clusterscontract dnum count*380sample 25 count*95 school districtssort dnum save oscss2.dta, replace use apipops2.dta, clear sort dnum merge dnum using oscss2.dta drop if _merge != 3 count* 669 schoolssort dnum by dnum: gen n = _n summ n* 1 to 72

* selecting schools within districtsby dnum: gen xx = uniform() sort dnum xx by dnum: gen number = _n by dnum: gen N = _N drop if number > 3 count* 239 schools sampledsort dnum number by dnum: gen nn = _N* creating the pweights and fpcgen p1 = 380/95 gen p2 = N/nn gen pwt = p1*p2 gen fpc = 380 save strata2.dta, replace

At long last, we are ready to concatenate (stack) the data sets (the 2 strata) together. We will also create some variables that we will need for the analyses shown in this section.

append using strata1.dta count* 466gen comp_imp1 = comp_imp - 1 recode awards (1 = 0) ( 2= 1) gen meals3 = 2 replace meals3 = 1 if meals <= 33 replace meals3 = 3 if meals > 67 save strataboth.dta, replace svyset dnum [pweight = pwt], fpc(fpc) strata(strata) svydes

pweight: pwt Strata: strata PSU: dnum FPC: fpc #Obs per PSU Strata ---------------------------- strata #PSUs #Obs min mean max -------- -------- -------- -------- -------- -------- 1 94 227 1 2.4 3 2 95 239 1 2.5 3 -------- -------- -------- -------- -------- -------- 2 189 466 1 2.5 3