When analyzing survey data, it is common to want to look only a certain respondents, perhaps only women, or only respondents over age 50. When analyzing these subpopulations (AKA domains), you need to use the appropriate option. Stata has two subpopulation options that are very flexible and easy to use. Using the subpopulation option(s) is extremely important when analyzing survey data. If the data set is subset, meaning that observations not to be included in the subpopulation are deleted from the data set, the standard errors of the estimates cannot be calculated correctly. When the subpopulation option(s) is used, only the cases defined by the subpopulation are used in the calculation of the estimate, but all cases are used in the calculation of the standard errors. For more information on this issue, please see Sampling Techniques, Third Edition by William G. Cochran (1977) and Small Area Estimation by J. N. K. Rao (2003).
For the sake of consistency, we will use the mean command for all of our examples. However, the subpop and over options work the same for all svy commands.
We will start by looking at the mean of our continuous variable, ell. Next, we will consider two variables to use with the subpop option, yr_rnd, which is coded 0/1, and both, which is coded 1/2. As you will see, the subpop option handles these two variables differently.
use https://stats.idre.ucla.edu/stat/stata/seminars/svy_stata_intro/strsrs, clear svy: mean ell (running mean on estimation sample) Survey: Mean estimation Number of strata = 2 Number of obs = 620 Number of PSUs = 620 Population size = 6194 Design df = 618 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ ell | 22.83578 .669696 21.52063 24.15094 --------------------------------------------------------------
Here we can see that yr_rnd is coded 0/1. (The missing option is used here to show that there are no missing values for this variable. We will want to know this later on.) Notice in the output of the svy: tab command that there are 789.6 cases coded 1. (It is not a whole number because we are estimating this value using the probability weights.) In the output of the svy: mean command, we also see that 789.552 cases are included in the subpopulation.
svy: tab yr_rnd, count nolabel missing (running tabulate on estimation sample) Number of strata = 2 Number of obs = 620 Number of PSUs = 620 Population size = 6193.9997 Design df = 618 ---------------------- yr_rnd | count ----------+----------- 0 | 5404 1 | 789.6 | Total | 6194 ---------------------- Key: count = weighted counts svy, subpop(yr_rnd): mean ell (running mean on estimation sample) Survey: Mean estimation Number of strata = 2 Number of obs = 620 Number of PSUs = 620 Population size = 6194 Subpop. no. obs = 79 Subpop. size = 789.552 Design df = 618 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ ell | 43.50105 2.658549 38.28016 48.72193 --------------------------------------------------------------
Now let’s try to use a variable coded 1/2 instead of 0/1. Here we can see that both is coded 1/2. (The missing option is used here to show that there are no missing values for this variable. We will want to know this later on.) Notice in the output of the svy: tab command that there are 1888 cases coded 1. However, in the output of the svy: mean command, we see that all of the observations, 6194 cases, are included in the subpopulation. This is because the subpop option must have a true/false variable. As stated in the Stata Survey manual, when the subpop option is used, the subpopulation is actually defined by the 0s (false), which indicate those cases to be excluded from the subpopulation. Non-0 values are included in the analysis, except for missing values, which are excluded from the analysis. Because we have no cases coded as 0, all of the cases are included in the subpopulation, as explained in the note in the output.
svy: tab both, count nolabel missing (running tabulate on estimation sample) Number of strata = 2 Number of obs = 620 Number of PSUs = 620 Population size = 6193.9997 Design df = 618 ---------------------- both | count ----------+----------- 1 | 1888 2 | 4306 | Total | 6194 ---------------------- Key: count = weighted counts svy, subpop(both): mean ell (running mean on estimation sample) Note: subpop() subpopulation is same as full population subpop() = 1 indicates observation in subpopulation subpop() = 0 indicates observation not in subpopulation Survey: Mean estimation Number of strata = 2 Number of obs = 620 Number of PSUs = 620 Population size = 6194 Subpop. no. obs = 620 Subpop. size = 6194 Design df = 618 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ ell | 22.83578 .669696 21.52063 24.15094 --------------------------------------------------------------
Now let’s create a copy of both and recode the 1s to 0s. We will also set some values to missing, to see what happens with missing values in the subpopulation variable. The output of the tab command shows us that the recoding went as planned. The output of the svy: mean command shows that the all of the cases not coded 0 or missing (the 424 cases coded as 2) are included in the subpopulation. Notice the note that Stata provides when the subpopulation variable is not coded 0/1.
gen both1 = both recode both1 (1=0) (both1: 189 changes made) replace both1 = . if _n < 11 (10 real changes made, 10 to missing) tab both1, missing both1 | Freq. Percent Cum. ------------+----------------------------------- 0 | 186 30.00 30.00 2 | 424 68.39 98.39 . | 10 1.61 100.00 ------------+----------------------------------- Total | 620 100.00svy, subpop(both1): mean ell (running mean on estimation sample) Note: subpop() takes on values other than 0 and 1 subpop() != 0 indicates subpopulation Survey: Mean estimation Number of strata = 2 Number of obs = 610 Number of PSUs = 610 Population size = 6094.03 Subpop. no. obs = 424 Subpop. size = 4235.65 Design df = 608 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ ell | 22.03727 .894207 20.28116 23.79338 --------------------------------------------------------------
You can also use if when defining your subpopulation. It should be stressed that this is VERY different from using if to remove cases from an analysis. Using if in the subpop option does not remove cases from the analysis. The cases excluded from the subpopulation by the if are still used in the calculation of the standard errors, as they should be.
svy, subpop(yr_rnd if mobility < 50): mean ell (running mean on estimation sample) Survey: Mean estimation Number of strata = 2 Number of obs = 620 Number of PSUs = 620 Population size = 6194 Subpop. no. obs = 78 Subpop. size = 779.555 Design df = 618 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ ell | 43.86654 2.668957 38.62521 49.10786 -------------------------------------------------------------- svy, subpop(yr_rnd if mobility < 50 & hsg < 80): mean ell (running mean on estimation sample) Survey: Mean estimation Number of strata = 2 Number of obs = 620 Number of PSUs = 620 Population size = 6194 Subpop. no. obs = 78 Subpop. size = 779.555 Design df = 618 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ ell | 43.86654 2.668957 38.62521 49.10786 --------------------------------------------------------------
You can use either subpop or over with multiple variables to create the subpopulation that you want. Let’s see some examples using the over option. First, we will use yr_rnd, our 0/1 variable, then both, our 1/2 variable. Notice that the output is different from the output using the subpop option in that both categories of the variable are given, and there is no note when a 1/2 variable is used. Please note that the over option is only available for the survey commands mean, proportion, ratio and total.
svy: mean ell, over(yr_rnd) (running mean on estimation sample) Survey: Mean estimation Number of strata = 2 Number of obs = 620 Number of PSUs = 620 Population size = 6194 Design df = 618 0: yr_rnd = 0 No: yr_rnd = No -------------------------------------------------------------- | Linearized Over | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ ell | 0 | 19.81673 .6771138 18.48701 21.14646 No | 43.50105 2.658549 38.28016 48.72193 -------------------------------------------------------------- svy: mean ell, over(both) (running mean on estimation sample) Survey: Mean estimation Number of strata = 2 Number of obs = 620 Number of PSUs = 620 Population size = 6194 Design df = 618 No: both = No Yes: both = Yes -------------------------------------------------------------- | Linearized Over | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ ell | No | 24.64196 1.329677 22.03073 27.25319 Yes | 22.04363 .8854687 20.30473 23.78252 --------------------------------------------------------------
Now let’s use both yr_rnd and both as the subpopulation variables. First we will use the svy: tab command to ensure that there are cases in all four categories. Then we use the svy: mean command with the over option.
svy: tab yr_rnd both, count (running tabulate on estimation sample) Number of strata = 2 Number of obs = 620 Number of PSUs = 620 Population size = 6193.9997 Design df = 618 ------------------------------- | met both targets yr_rnd | No Yes Total ----------+-------------------- 0 | 1659 3746 5404 No | 229.9 559.7 789.6 | Total | 1888 4306 6194 ------------------------------- Key: weighted counts Pearson: Uncorrected chi2(1) = 0.0807 Design-based F(1, 618) = 0.0896 P = 0.7647 svy: mean ell, over(yr_rnd both) (running mean on estimation sample) Survey: Mean estimation Number of strata = 2 Number of obs = 620 Number of PSUs = 620 Population size = 6194 Design df = 618 Over: yr_rnd both _subpop_1: 0 No _subpop_2: 0 Yes _subpop_3: No No _subpop_4: No Yes -------------------------------------------------------------- | Linearized Over | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ ell | _subpop_1 | 21.72287 1.246971 19.27405 24.17168 _subpop_2 | 18.9728 .8907884 17.22346 20.72213 _subpop_3 | 45.70399 4.841131 36.19692 55.21105 _subpop_4 | 42.59631 3.1987 36.31468 48.87795 --------------------------------------------------------------
Below we create a new variable from emer with four categories. Then we will use this variable with yr_rnd and both; all combinations of the variables are shown in the output. This is often very useful and saves you from having to create a new subpopulation variable. However, if each of your variables have many categories, the output can become long and cumbersome, especially if you are only interested in a few combinations of categories.
egen emergrp = cut(emer), group(5) svy: mean ell, over(emergrp) (running mean on estimation sample) Survey: Mean estimation Number of strata = 2 Number of obs = 620 Number of PSUs = 620 Population size = 6194 Design df = 618 1: emergrp = 1 2: emergrp = 2 3: emergrp = 3 4: emergrp = 4 -------------------------------------------------------------- | Linearized Over | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ ell | 1 | 14.53282 .9707122 12.62652 16.43911 2 | 19.21555 1.594388 16.08447 22.34662 3 | 26.41136 1.815472 22.84612 29.97661 4 | 38.60681 1.926595 34.82334 42.39028 -------------------------------------------------------------- svy: mean ell, over(emergrp yr_rnd both) (running mean on estimation sample) Survey: Mean estimation Number of strata = 2 Number of obs = 620 Number of PSUs = 620 Population size = 6194 Design df = 618 Over: emergrp yr_rnd both _subpop_1: 1 0 No _subpop_2: 1 0 Yes _subpop_3: 1 No No _subpop_4: 1 No Yes _subpop_5: 2 0 No _subpop_6: 2 0 Yes _subpop_7: 2 No No _subpop_8: 2 No Yes _subpop_9: 3 0 No _subpop_10: 3 0 Yes _subpop_11: 3 No No _subpop_12: 3 No Yes _subpop_13: 4 0 No _subpop_14: 4 0 Yes _subpop_15: 4 No No _subpop_16: 4 No Yes -------------------------------------------------------------- | Linearized Over | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ ell | _subpop_1 | 19.04537 2.13396 14.85468 23.23606 _subpop_2 | 12.37844 1.113872 10.19101 14.56588 _subpop_3 | 17.33333 5.380975 6.766121 27.90055 _subpop_4 | 25.9189 6.110832 13.91839 37.91941 _subpop_5 | 18.32239 2.38222 13.64416 23.00061 _subpop_6 | 18.38206 2.244956 13.97339 22.79072 _subpop_7 | 26.01227 12.7449 .9837157 51.04082 _subpop_8 | 27.36803 5.615893 16.33949 38.39658 _subpop_9 | 22.41529 2.814223 16.8887 27.94189 _subpop_10 | 24.47567 2.371769 19.81797 29.13337 _subpop_11 | 50.29112 6.681067 37.17077 63.41146 _subpop_12 | 39.33854 7.992979 23.64185 55.03524 _subpop_13 | 29.11945 2.818771 23.58392 34.65498 _subpop_14 | 32.95607 2.611229 27.82811 38.08403 _subpop_15 | 54.09091 6.870972 40.59763 67.58419 _subpop_16 | 57.8 3.730597 50.47382 65.12618 --------------------------------------------------------------
The subpop option can be combined with the over option. This is handy because if cannot be used with the over option. By combining the options, you can have “the best of both worlds.”
svy, subpop(yr_rnd if mobility < 50 & hsg < 80): mean ell, over(emergrp both) (running mean on estimation sample) Survey: Mean estimation Number of strata = 2 Number of obs = 620 Number of PSUs = 620 Population size = 6194 Subpop. no. obs = 78 Subpop. size = 779.555 Design df = 618 Over: emergrp both _subpop_1: 1 No _subpop_2: 1 Yes _subpop_3: 2 No _subpop_4: 2 Yes _subpop_5: 3 No _subpop_6: 3 Yes _subpop_7: 4 No _subpop_8: 4 Yes -------------------------------------------------------------- | Linearized Over | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ ell | _subpop_1 | 17.33333 5.380975 6.766121 27.90055 _subpop_2 | 25.9189 6.110832 13.91839 37.91941 _subpop_3 | 26.01227 12.7449 .9837157 51.04082 _subpop_4 | 27.36803 5.615893 16.33949 38.39658 _subpop_5 | 50.29112 6.681067 37.17077 63.41146 _subpop_6 | 42.38135 8.451664 25.78389 58.97882 _subpop_7 | 54.09091 6.870972 40.59763 67.58419 _subpop_8 | 57.8 3.730597 50.47382 65.12618 --------------------------------------------------------------