Sometimes you may be analyzing a very large data file and want to work with just a simple random sample of the data file. Other times you may want to draw a simple random sample with replacement from a small data file. Either way, SAS proc surveyselect is one way to do it, and it is fairly straightforward. Let’s use the following data set for the purpose of demonstration.
DATA hsb25; INPUT id gender $ race ses schtype $ prog read write math science socst; DATALINES; 147 f 1 3 pub 1 47 62 53 53 61 108 m 1 2 pub 2 34 33 41 36 36 18 m 3 2 pub 3 50 33 49 44 36 153 m 1 2 pub 3 39 31 40 39 51 50 m 2 2 pub 2 50 59 42 53 61 51 f 2 1 pub 2 42 36 42 31 39 102 m 1 1 pub 1 52 41 51 53 56 57 f 1 2 pub 1 71 65 72 66 56 160 f 1 2 pub 1 55 65 55 50 61 136 m 1 2 pub 1 65 59 70 63 51 88 f 1 1 pub 1 68 60 64 69 66 177 m 1 2 pri 1 55 59 62 58 51 95 m 1 1 pub 1 73 60 71 61 71 144 m 1 1 pub 2 60 65 58 61 66 139 f 1 2 pub 1 68 59 61 55 71 135 f 1 3 pub 1 63 60 65 54 66 191 f 1 1 pri 1 47 52 43 48 61 171 m 1 2 pub 1 60 54 60 55 66 22 m 3 2 pub 3 42 39 39 56 46 47 f 2 3 pub 1 47 46 49 33 41 56 m 1 2 pub 3 55 45 46 58 51 128 m 1 1 pub 1 39 33 38 47 41 36 f 2 3 pub 2 44 49 44 35 51 53 m 2 2 pub 3 34 37 46 39 31 26 f 4 1 pub 1 60 59 62 61 51 ; RUN;
Random sampling without replacement
In a simple random sample without replacement each observation in the data set has an equal chance of being selected, once selected it can not be chosen again. The following code creates a simple random sample of size 10 from the data set hsb25. Here the method option on the proc surveyselect statement specifies the method to be SRS (simple random sampling). The rep (=replicate) option specifies the number of simple random samples you want create. The sampsize is a required option here specifying the size of the random sample. This number has to be smaller than the size of the original data set, since the sampling is done without replacement. You can also specify the seed so a precise replicate can be reproduced later using the same seed. The id statement is used to specify the variables to be included in the sample. Here we use _all_ to include all the variables to be in the sample.
proc surveyselect data = hsb25 method = SRS rep = 1 sampsize = 10 seed = 12345 out = hsbs1; id _all_; run; proc print data = hsbs1 noobs; run; id gender race ses schtype prog read write math science socst 108 m 1 2 pub 2 34 33 41 36 36 153 m 1 2 pub 3 39 31 40 39 51 51 f 2 1 pub 2 42 36 42 31 39 95 m 1 1 pub 1 73 60 71 61 71 139 f 1 2 pub 1 68 59 61 55 71 135 f 1 3 pub 1 63 60 65 54 66 191 f 1 1 pri 1 47 52 43 48 61 22 m 3 2 pub 3 42 39 39 56 46 47 f 2 3 pub 1 47 46 49 33 41 53 m 2 2 pub 3 34 37 46 39 31
Random sampling with replacement
In a random sample with replacement, each observation in the data set has an equal chance to be selected and can be selected over and over again. The following code creates a random sample with replacement of size 10. We can see from the output that observations with id = 139 and id = 128 have been selected twice because we now allow replacement in the sampling. The method = urs (unrestricted random sampling) is used here to allow the replacement. We will only include variables id, read, write, math, science and socst in the sample data set.
proc surveyselect data=hsb25 method = urs sampsize = 10 rep=1 seed=12345 out=hsbs2 outhits; id id read write math science socst; run; proc print data = hsbs2 noobs; run;Number Replicate id read write math science socst Hits 1 57 71 65 72 66 56 1 1 136 65 59 70 63 51 1 1 177 55 59 62 58 51 1 1 139 68 59 61 55 71 2 1 191 47 52 43 48 61 1 1 56 55 45 46 58 51 1 1 128 39 33 38 47 41 2 1 26 60 59 62 61 51 1