Suppose you had a file with 25 observations that had a variable identifying the observations called id and you had information about the observation, here we just have age.
DATA orig; INPUT id age; CARDS; 1 3 2 32 3 13 4 16 5 4 6 9 7 43 8 29 9 43 10 47 11 13 12 6 13 43 14 48 15 34 16 13 17 47 18 6 19 34 20 42 21 47 22 49 23 28 24 25 25 39 ; RUN;
Suppose you want to make a new id variable called newid that is unique for all observations but conceals the identify of who the observation is. The strategy for this can be done like this.
1. Create a new data file with IDs in it (we will call this newids). Make more IDs than necessary because there may be duplicate IDs.
2. Eliminate any records with duplicate newid in the newids data file.
3. Scramble the order of the
newids
file (so the order of newid does not give away the person’s
identity).
4. Merge newids with the
original data file (orig), and get rid of the old id variable.
5. During the merge in step 4, make a file called crossref that shows the correspondence between id and newid.
6. Store crossref in a safe place since that file can be used with orig2 to determine the identify of the observations.
1. Here we make newid which is the new random ID and we make ranord which will be used for scrambling the data file.
data NEWIDS; do NOBS = 1 to 40 ; /* we make up 40 observations in case of duplicates */ newid = " " ; /* newid will be 5 characters wide */ do i = 1 to 5; /* create each digit of newid, 1 - 5 */ * make random number 0-35, 0-9, a-z ; rannum = int(uniform(0)*36) ; * if it is 0-9, convert it into 0-9, which is byte(48) - byte(57) ; if (0 <= rannum <= 9) then ranch = byte(rannum + 48) ; * if it is 10-36, convert it into a-z, which is byte(65)-byte(90) ; if (10 <= rannum <= 36) then ranch = byte(rannum + 55); * combine each digit of "newid" ; substr(newid,i,1) = ranch ; end; * make ranord ; ranord = uniform(0) ; output ; end; * just keep "newid" and "ranord" ; keep newid ranord ; run;
2. Get rid of any observations with a duplicated newid. in newids.
PROC SORT DATA=newids NODUPKEY; BY newid ; RUN;
3. Scramble the order of newids so the order of the variables does not give any the identify of the observations.
PROC SORT DATA=newids ; BY ranord ; RUN;
4. Now, merge orig with newids. If id is missing, that means we have matched all orig observations with newids and it is a newids without an orig, so we should delete the observation. For orig2 drop id and ranord so the identity is now anonymous.
5. For crossref, keep id and newid so the identity can be looked up by you if you need to. Keep crossref in a safe, secret place.
DATA orig2(DROP=id ranord) crossref(KEEP=id newid); MERGE orig newids ; IF (id = .) THEN DELETE ; run;
Show new version of original data file with newid.
PROC PRINT DATA=orig2(obs=10); RUN;OBS AGE NEWID 1 3 QMB02 2 32 1QXCR 3 13 VO5FC 4 16 4C63M 5 4 2QQR8 6 9 VT4O5 7 43 W9IFN 8 29 BHPJW 9 43 B0LJQ 10 47 QN0CC
Show cross reference file, with id and newid.
PROC PRINT DATA=crossref(obs=10); RUN;OBS ID NEWID 1 1 QMB02 2 2 1QXCR 3 3 VO5FC 4 4 4C63M 5 5 2QQR8 6 6 VT4O5 7 7 W9IFN 8 8 BHPJW 9 9 B0LJQ 10 10 QN0CC