When two people enter the same data (double data entry), a concern is whether discrepancies exist between the two datasets (the rationale of double data entry), and if so, where. We start by reading in the two datasets, one entered by person1 and the second by person2.
data person1; input id name $ age ht wt income; datalines; 11 john 23 68 145 23000 12 charlie 25 72 178 45000 13 sally 21 64 135 12000 4 mike 34 70 156 5600 43 paul 30 73 189 15600 ; run; data person2; input id name $ age ht wt income; datalines; 11 john 23.5 68 145 23000 12 charles 25 52 178 45000 13 sally 21 64 . 12000 4 michael 34 70 156 5600 43 Paul 30 73 189 5600 ; run;
We start by sorting the two datasets by the id variable, id, and then use the compare procedure to see if any discrepancies exist between the two datasets.
proc sort data = person1; by id; run; proc sort data = person2; by id; run; proc compare base = person1 compare = person2 novalues; run; The COMPARE Procedure Comparison of WORK.PERSON1 with WORK.PERSON2 (Method=EXACT) Data Set Summary Dataset Created Modified NVar NObs WORK.PERSON1 18JAN06:09:01:28 18JAN06:09:01:28 6 5 WORK.PERSON2 18JAN06:09:01:28 18JAN06:09:01:28 6 5 Variables Summary Number of Variables in Common: 6. Observation Summary Observation Base Compare First Obs 1 1 First Unequal 1 1 Last Unequal 5 5 Last Obs 5 5 Number of Observations in Common: 5. Total Number of Observations Read from WORK.PERSON1: 5. Total Number of Observations Read from WORK.PERSON2: 5. Number of Observations with Some Compared Variables Unequal: 5. Number of Observations with All Compared Variables Equal: 0. Values Comparison Summary Number of Variables Compared with All Observations Equal: 1. Number of Variables Compared with Some Observations Unequal: 5. Number of Variables with Missing Value Differences: 1. Total Number of Values which Compare Unequal: 7. Maximum Difference: 10000. Variables with Unequal Values Variable Type Len Ndif MaxDif MissDif name CHAR 8 3 0 age NUM 8 1 0.500 0 ht NUM 8 1 20.000 0 wt NUM 8 1 0 1 income NUM 8 1 10000 0
The basic compare procedure revealed that differences do exist. We now want to find the discrepancies by id. We use the by statement to give the discrepancies by observations; if we didn’t have that statement, discrepancies would have been given by the variables. This statement makes it convenient to correct the errors on a case-by-case basis.
proc compare base = person1 compare = person2 brief; by id; id id; run; The COMPARE Procedure Comparison of WORK.PERSON1 with WORK.PERSON2 (Method=EXACT) id=4 NOTE: Values of the following 1 variables compare unequal: name Value Comparison Results for Variables _________________________________________________________ || Base Value Compare Value id || name name _______ || ________ ________ || 4 || mike michael _________________________________________________________ id=11 NOTE: Values of the following 1 variables compare unequal: age Value Comparison Results for Variables _________________________________________________________ || Base Compare id || age age Diff. % Diff _______ || _________ _________ _________ _________ || 11 || 23.0000 23.5000 0.5000 2.1739 _________________________________________________________ id=12 NOTE: Values of the following 2 variables compare unequal: name ht Value Comparison Results for Variables _________________________________________________________ || Base Value Compare Value id || name name _______ || ________ ________ || 12 || charlie charles _________________________________________________________ _________________________________________________________ || Base Compare id || ht ht Diff. % Diff _______ || _________ _________ _________ _________ || 12 || 72.0000 52.0000 -20.0000 -27.7778 _________________________________________________________ id=13 NOTE: Values of the following 1 variables compare unequal: wt Value Comparison Results for Variables _________________________________________________________ || Base Compare id || wt wt Diff. % Diff _______ || _________ _________ _________ _________ || 13 || 135.0000 . . . _________________________________________________________ id=43 NOTE: Values of the following 2 variables compare unequal: name income Value Comparison Results for Variables _________________________________________________________ || Base Value Compare Value id || name name _______ || ________ ________ || 43 || paul Paul _________________________________________________________ ________________________________________________________ || Base Compare id || income income Diff. % Diff _______ || _________ _________ _________ _________ || 43 || 15600 5600 -10000 -64.1026 _________________________________________________________
We note that from the last case, id = 43, the procedure is case sensitive for character variables.