When two people enter the same data (double data entry), a concern is whether discrepancies exist between the two datasets (the rationale of double data entry), and if so, where. We start by reading in the two datasets, one entered by person1 and the second by person2.
data person1; input id name $ age ht wt income; datalines; 11 john 23 68 145 23000 12 charlie 25 72 178 45000 13 sally 21 64 135 12000 4 mike 34 70 156 5600 43 paul 30 73 189 15600 ; run; data person2; input id name $ age ht wt income; datalines; 11 john 23.5 68 145 23000 12 charles 25 52 178 45000 13 sally 21 64 . 12000 4 michael 34 70 156 5600 43 Paul 30 73 189 5600 ; run;
We start by sorting the two datasets by the id variable, id, and then use the compare procedure to see if any discrepancies exist between the two datasets.
proc sort data = person1; by id; run; proc sort data = person2; by id; run; proc compare base = person1 compare = person2 novalues; run; The COMPARE Procedure Comparison of WORK.PERSON1 with WORK.PERSON2 (Method=EXACT) Data Set Summary Dataset Created Modified NVar NObs WORK.PERSON1 18JAN06:09:01:28 18JAN06:09:01:28 6 5 WORK.PERSON2 18JAN06:09:01:28 18JAN06:09:01:28 6 5 Variables Summary Number of Variables in Common: 6. Observation Summary Observation Base Compare First Obs 1 1 First Unequal 1 1 Last Unequal 5 5 Last Obs 5 5 Number of Observations in Common: 5. Total Number of Observations Read from WORK.PERSON1: 5. Total Number of Observations Read from WORK.PERSON2: 5. Number of Observations with Some Compared Variables Unequal: 5. Number of Observations with All Compared Variables Equal: 0. Values Comparison Summary Number of Variables Compared with All Observations Equal: 1. Number of Variables Compared with Some Observations Unequal: 5. Number of Variables with Missing Value Differences: 1. Total Number of Values which Compare Unequal: 7. Maximum Difference: 10000. Variables with Unequal Values Variable Type Len Ndif MaxDif MissDif name CHAR 8 3 0 age NUM 8 1 0.500 0 ht NUM 8 1 20.000 0 wt NUM 8 1 0 1 income NUM 8 1 10000 0
The basic compare procedure revealed that differences do exist. We now want to find the discrepancies by id. We use the by statement to give the discrepancies by observations; if we didn’t have that statement, discrepancies would have been given by the variables. This statement makes it convenient to correct the errors on a case-by-case basis.
proc compare base = person1 compare = person2 brief;
by id;
id id;
run;
The COMPARE Procedure
Comparison of WORK.PERSON1 with WORK.PERSON2
(Method=EXACT)
id=4
NOTE: Values of the following 1 variables compare unequal: name
Value Comparison Results for Variables
_________________________________________________________
|| Base Value Compare Value
id || name name
_______ || ________ ________
||
4 || mike michael
_________________________________________________________
id=11
NOTE: Values of the following 1 variables compare unequal: age
Value Comparison Results for Variables
_________________________________________________________
|| Base Compare
id || age age Diff. % Diff
_______ || _________ _________ _________ _________
||
11 || 23.0000 23.5000 0.5000 2.1739
_________________________________________________________
id=12
NOTE: Values of the following 2 variables compare unequal: name ht
Value Comparison Results for Variables
_________________________________________________________
|| Base Value Compare Value
id || name name
_______ || ________ ________
||
12 || charlie charles
_________________________________________________________
_________________________________________________________
|| Base Compare
id || ht ht Diff. % Diff
_______ || _________ _________ _________ _________
||
12 || 72.0000 52.0000 -20.0000 -27.7778
_________________________________________________________
id=13
NOTE: Values of the following 1 variables compare unequal: wt
Value Comparison Results for Variables
_________________________________________________________
|| Base Compare
id || wt wt Diff. % Diff
_______ || _________ _________ _________ _________
||
13 || 135.0000 . . .
_________________________________________________________
id=43
NOTE: Values of the following 2 variables compare unequal: name income
Value Comparison Results for Variables
_________________________________________________________
|| Base Value Compare Value
id || name name
_______ || ________ ________
||
43 || paul Paul
_________________________________________________________
________________________________________________________
|| Base Compare
id || income income Diff. % Diff
_______ || _________ _________ _________ _________
||
43 || 15600 5600 -10000 -64.1026
_________________________________________________________
We note that from the last case, id = 43, the procedure is case sensitive for character variables.
