There are several ways that you can analyze a temporary or a permanent subset of your data. The examples below will illustrate some of these methods.
Example 1: Creating a random sample
There may be times when you would like to analyze only a subset of your data. For example, suppose that you have a huge data file with thousands of cases, and that you written a syntax file to analyze the data. Because the syntax may take hours to run, you may want to take a relatively small sample of your data and run the syntax on that to see if it works properly. There are several ways that you could create a sub-sample, such as using the only the first 100 cases. However, in this situation, it may be best to take a random sample of your data. The SPSS command to do this is sample. For this example, we will randomly select 20% of the data, and we will use the means command to show the effect of taking the subset.
Let’s consider the following data set. It has two independent variables (iv1 and iv2) and two dependent variables (dv1 and dv2).
data list list / sub iv1 iv2 dv1 dv2. begin data 1 1 1 . 25 2 1 1 49 37 3 1 1 50 55 4 2 1 . 19 5 2 1 20 38 6 2 0 23 48 7 2 0 28 44 8 3 0 28 68 9 3 0 . 30 10 3 0 32 36 end data.
Note that in the SPSS output there are a series of warnings messages. The data was read in correctly. However, SPSS is letting the user know that it took the missing values designated by a "." (in the raw data file above) and has read them in as system-defined missing.
save outfile 'c:sset.sav'. means dv2 by iv1.
Case Processing Summary Cases Included Excluded Total N Percent N Percent N Percent DV2 * IV1 10 100.0% 0 .0% 10 100.0%
Report
DV2IV1 Mean N Std. Deviation 1.00 39.0000 3 15.09967 2.00 37.2500 4 12.84199 3.00 44.6667 3 20.42874 Total 40.0000 10 14.46836 sample .20. means dv2 by iv1.
Case Processing Summary Cases Included Excluded Total N Percent N Percent N Percent DV2 * IV1 4 100.0% 0 .0% 4 100.0%
Report
DV2IV1 Mean N Std. Deviation 1.00 25.0000 1 . 2.00 38.0000 1 . 3.00 33.0000 2 4.24264 Total 32.2500 4 5.90903
Be aware that sample takes a permanent sample of the data in the working file. In other words, the cases that are not selected are deleted. The next example will illustrate how to take a subset without deleting the non-selected cases.
Example 2: Creating a temporary random sample
The temporary command can be used with most SPSS commands, and we will use it here to create a temporary subset of the data in the working file. The temporary command allows you to create or transform variables and is in effect only until the next procedure is executed. In the example below, the means command is the procedure that will terminate the temporary command. To illustrate this, we will issue the means command twice. The first time, the temporary command will be in effect, and the descriptive statistics will reflect the reduced number of cases. It will also terminate the temporary command, so that the second means command will be run on the full data set.
get file 'c:sset.sav'. temporary. sample .20. means dv2 by iv1.
Case Processing Summary Cases Included Excluded Total N Percent N Percent N Percent DV2 * IV1 4 100.0% 0 .0% 4 100.0%
Report
DV2IV1 Mean N Std. Deviation 1.00 39.0000 3 15.09967 2.00 19.0000 1 . Total 34.0000 4 15.87451
means dv2 by iv1.
Case Processing Summary Cases Included Excluded Total N Percent N Percent N Percent DV2 * IV1 10 100.0% 0 .0% 10 100.0%
Report
DV2IV1 Mean N Std. Deviation 1.00 39.0000 3 15.09967 2.00 37.2500 4 12.84199 3.00 44.6667 3 20.42874 Total 40.0000 10 14.46836
Example 3: Selecting a specific number of cases
The sample command can also be used to select a specific number of cases. For example, suppose that you wanted to obtain descriptive statistics on four cases randomly drawn from the first eight cases in the data set.
temporary. sample 4 from 8. means dv2 by iv1.
Case Processing Summary Cases Included Excluded Total N Percent N Percent N Percent DV2 * IV1 4 100.0% 0 .0% 4 100.0%
Report
DV2IV1 Mean N Std. Deviation 1.00 25.0000 1 . 2.00 43.3333 3 5.03322 Total 38.7500 4 10.04573
If you wanted the four cases to be drawn from the entire data set, you would simply put the total number of cases after the keyword from.
temporary. sample 4 from 10. means dv2 by iv1.
Case Processing Summary Cases Included Excluded Total N Percent N Percent N Percent DV2 * IV1 4 100.0% 0 .0% 4 100.0%
Report
DV2IV1 Mean N Std. Deviation 1.00 37.0000 1 . 2.00 38.0000 1 . 3.00 33.0000 2 4.24264 Total 35.2500 4 3.59398
Example 4: Selecting a specific number of the first cases
Suppose that you read a large data file into SPSS and you just wanted to see if the data were read in properly. Because the file is large, running descriptive statistics on the entire data set would be time consuming, and would probably not be any more informative than running the descriptive statistics on a small sub-set. You could use the n of cases command to select, say, the first 50 cases in the data file. As with the sample command, this n of cases command permanently modifies your data set. If you do not want the rest of the cases to be deleted, you will need to use the temporary command just before the n of cases command. Also note that the n of cases command can be shortened to n.
temporary. n of cases 5. means dv2 by iv1.
Case Processing Summary Cases Included Excluded Total N Percent N Percent N Percent DV2 * IV1 5 100.0% 0 .0% 5 100.0%
Report
DV2IV1 Mean N Std. Deviation 1.00 39.0000 3 15.09967 2.00 28.5000 2 13.43503 Total 34.8000 5 13.86362
Equivalently,
temporary. n 5. means dv2 by iv1.
Case Processing Summary Cases Included Excluded Total N Percent N Percent N Percent DV2 * IV1 5 100.0% 0 .0% 5 100.0%
Report
DV2IV1 Mean N Std. Deviation 1.00 39.0000 3 15.09967 2.00 28.5000 2 13.43503 Total 34.8000 5 13.86362
Example 5: Selecting cases based on value of one or more variables
Sometimes you may want to select cases based on the value of one or more variables. For example, suppose that you wanted to obtain descriptive statistics for only the those cases where iv3 was greater than two and dv2 was less than 40. The select if command will permanently select that subset of cases from your data set. As with the other commands, you can use the temporary command to temporarily select the desired cases.
temporary. select if (iv1 gt 1 and dv2 lt 40). means dv2 by iv2.
Case Processing Summary Cases Included Excluded Total N Percent N Percent N Percent DV2 * IV2 4 100.0% 0 .0% 4 100.0%
Report
DV2IV2 Mean N Std. Deviation .00 33.0000 2 4.24264 1.00 28.5000 2 13.43503 Total 30.7500 4 8.53913
You can also use the select if command to select cases that have a missing value for the variable of interest. For example, suppose that you wanted to select and analyze the cases for which dv1 had a missing value.
temporary. select if (sysmis(dv1)). means dv2 by iv1.
Case Processing Summary Cases Included Excluded Total N Percent N Percent N Percent DV2 * IV1 3 100.0% 0 .0% 3 100.0%
Report
DV2IV1 Mean N Std. Deviation 1.00 25.0000 1 . 2.00 19.0000 1 . 3.00 30.0000 1 . Total 24.6667 3 5.50757
Example 6: Filtering by a variable
Another way to subset your data is to filter them by a variable. The variable that is to be used as a filter must be a numeric variable that is coded zero/one (i.e., a dummy variable). The cases coded as zero will be filtered. If the filter variable is dichotomous, but coded say, one/two, SPSS will execute the command requested without a filter, and it will not issue either an error message or a warning message. You can tell if the filter is on by looking in the lower right-hand corner of the data editor for the "filter on" message. You can see which cases are being filtered by looking at the left-most column of the data editor. Cases with a slash through the number are being filtered. The filter command does not make permanent changes to your data set, and you can turn it off by issuing the filter off command. Let's suppose that you wanted to use iv2 as a filter.
filter by iv2. means dv2 by iv1.
Case Processing Summary Cases Included Excluded Total N Percent N Percent N Percent DV2 * IV1 5 100.0% 0 .0% 5 100.0%
Report
DV2IV1 Mean N Std. Deviation 1.00 39.0000 3 15.09967 2.00 28.5000 2 13.43503 Total 34.8000 5 13.86362 filter off.
Example 7: Subsetting to match percentage in sample to percentage in population
Suppose that you conducted a survey of 10000 people and 70% of your respondents were female. You know that females make up only about 52% of the population, so you would like to take a subset of your female respondents such that the proportion of females to males in your data is more similar to that found in the population. First, you need to calculate how many female respondents you want to keep in your data set. Next, you would put the sample command in a do if loop to create the subset. Finally, you would save the file with a new name, so that your original data would be preserved.
do if gender = 'female'. sample 3250 from 7000. end if. save outfile 'c:subset.sav'.