Many medical researchers want to be able to put their research data onto laptops or usb drives. However, for reasons of patient privacy, they shouldn’t carry around files that contain both patient identifiers along with patient data. What they need to do is create a data file that has fake IDs with the patient data and another file that has patient identifiers and the fake IDs if they ever need to figure our which fake Id is which patient. This page shows one way you can do this.
The master data file
The master data file should be keep on a secure computer in a secure location. It should not be kept on either a laptop that leaves the secure location nor on a usb drive.
Here is an example dataset with 10 observations on four patients. Note that the data contain both a patient ID and name along with the patient data.
use https://stats.idre.ucla.edu/stat/data/patient_data, clear list +-----------------------------------------------------------+ | pid name x y z | |-----------------------------------------------------------| 1. | 312-4737 Jackson, Mary 63.35427 125.6506 11.00734 | 2. | 247-3114 Jones, Alan 51.97711 88.53608 2.662157 | 3. | 153-8133 Brown, Albert 55.29694 114.453 10.82074 | 4. | 247-3114 Jones, Alan 66.30132 102.7997 10.5982 | 5. | 149-1324 Smith, John 57.32788 94.00582 6.630976 | |-----------------------------------------------------------| 6. | 153-8133 Brown, Albert 51.11831 82.4387 9.542996 | 7. | 153-8133 Brown, Albert 57.07088 84.79552 7.558742 | 8. | 149-1324 Smith, John 79.56186 99.08196 11.91376 | 9. | 247-3114 Jones, Alan 50.43491 93.8356 12.83728 | 10. | 153-8133 Brown, Albert 66.0938 81.51173 12.28129 | +-----------------------------------------------------------+
Next, we will generate a fake ID using the encode command. After which, we will strip off the indentifying value labels from the fake ID.
encode pid, gen(fid) label drop fid // strip off value labels list +-----------------------------------------------------------------+ | pid name x y z fid | |-----------------------------------------------------------------| 1. | 312-4737 Jackson, Mary 63.35427 125.6506 11.00734 4 | 2. | 247-3114 Jones, Alan 51.97711 88.53608 2.662157 3 | 3. | 153-8133 Brown, Albert 55.29694 114.453 10.82074 2 | 4. | 247-3114 Jones, Alan 66.30132 102.7997 10.5982 3 | 5. | 149-1324 Smith, John 57.32788 94.00582 6.630976 1 | |-----------------------------------------------------------------| 6. | 153-8133 Brown, Albert 51.11831 82.4387 9.542996 2 | 7. | 153-8133 Brown, Albert 57.07088 84.79552 7.558742 2 | 8. | 149-1324 Smith, John 79.56186 99.08196 11.91376 1 | 9. | 247-3114 Jones, Alan 50.43491 93.8356 12.83728 3 | 10. | 153-8133 Brown, Albert 66.0938 81.51173 12.28129 2 | +-----------------------------------------------------------------+
We are going to save the dataset under a new name, patient_data2, for the first time. In a little bit we will remove the identifying information and save it for a second time. In the meantime, we will create a file that contains both the old and new (fake) IDs using collapse.
save patient_data2, replace // save first time * create file with codes for each patient collapse (first) pid name, by(fid) list +--------------------------------+ | fid pid name | |--------------------------------| 1. | 1 149-1324 Smith, John | 2. | 2 153-8133 Brown, Albert | 3. | 3 247-3114 Jones, Alan | 4. | 4 312-4737 Jackson, Mary | +--------------------------------+ * save file as patient_code save patient_code, replace
The file, patient_code, can be used to go from the new fake IDs to the original patient IDs. This file should also be kept on a secure computer in a secure location.
The working data file
Finally, we will read patient_data2 back in and strip out all of the identifying information before resaving.
use patient_data2, clear drop pid name // drop identifying information save patient_data2, replace list +--------------------------------------+ | x y z fid | |--------------------------------------| 1. | 63.35427 125.6506 11.00734 4 | 2. | 51.97711 88.53608 2.662157 3 | 3. | 55.29694 114.453 10.82074 2 | 4. | 66.30132 102.7997 10.5982 3 | 5. | 57.32788 94.00582 6.630976 1 | |--------------------------------------| 6. | 51.11831 82.4387 9.542996 2 | 7. | 57.07088 84.79552 7.558742 2 | 8. | 79.56186 99.08196 11.91376 1 | 9. | 50.43491 93.8356 12.83728 3 | 10. | 66.0938 81.51173 12.28129 2 | +--------------------------------------+
The file, patient_data2, is now our working data file and is safe for use on laptops and usb drives.
Wait, what if pid is numeric? No problem, just use tostring and then follow the above steps.
tostring pid, replace