Centering a variable means that a constant has been subtracted from every value of a variable. There are several ways that you can center variables. For example, you could center the variable around a constant that has intrinsic meaning for the variable, such as centering a continuous variable age around 18 to represent when Americans come of voting age. You could also center a variable around its mean, or you could use a categorical variable to group your continuous variable, and get means for each group. Each of these techniques is shown below.
We will use the test data set presented below for all of our examples. We understand that for most purposes such a data set is unrealistically small, but its size makes it easier to see what is happening in each step.
data test; input studentid class score1 score2; cards; 1 1 34 24 2 1 39 25 3 1 34 26 4 1 38 20 5 1 32 21 1 2 45 36 2 2 43 30 3 2 48 39 4 2 41 37 5 2 40 31 1 3 50 46 2 3 51 49 3 3 57 48 4 3 50 40 5 3 57 46 ; run;
1. Centering a variable around a constant
Suppose that we wanted to center all of the values in the variable score1 around 45.
data center45; set test; c45 = score1 - 45; run; proc print data = center45; run;Obs studentid class score1 score2 c45 1 1 1 34 24 -11 2 1 2 45 36 0 3 1 3 50 46 5 4 2 1 39 25 -6 5 2 2 43 30 -2 6 2 3 51 49 6 7 3 1 34 26 -11 8 3 2 48 39 3 9 3 3 57 48 12 10 4 1 38 20 -7 11 4 2 41 37 -4 12 4 3 50 40 5 13 5 1 32 21 -13 14 5 2 40 31 -5 15 5 3 57 46 12
Now let’s center the scores for each class around a different constant. Let’s suppose that score1 for class 1 should be centered around 30, for class 2 the scores should centered around 40, and for class 3 the scores should centered around 50. The proc sort was added only to make the output easier to read; it is not necessary for the program to work.
data centerdiff; set test; if class = 1 then c1 = score1 - 30; if class = 2 then c1 = score1 - 40; if class = 3 then c1 = score1 - 50; run; proc sort data = centerdiff; by class studentid; run; proc print data = centerdiff; run;Obs studentid class score1 score2 c1 1 1 1 34 24 4 2 2 1 39 25 9 3 3 1 34 26 4 4 4 1 38 20 8 5 5 1 32 21 2 6 1 2 45 36 5 7 2 2 43 30 3 8 3 2 48 39 8 9 4 2 41 37 1 10 5 2 40 31 0 11 1 3 50 46 0 12 2 3 51 49 1 13 3 3 57 48 7 14 4 3 50 40 0 15 5 3 57 46 7
2. Grand mean centering
Instead of centering a variable around a value that you select, you may want to center it around its mean. This is known as grand mean centering. There are at least three ways that you can do this. Perhaps the most straight-forward way is to get the mean of each variable that you wan to center and subtract that value from the variable in a data step. This is simple if you only need to center a few variables.
proc means data = test mean; var score1 score2; run;Variable Mean ------------------------ score1 43.9333333 score2 34.5333333 ------------------------ data grand; set test; grmscore1 = score1 - 43.93; grmscore2 = score2 - 34.53; run; proc print data = grand; run;Obs studentid class score1 score2 grmscore1 grmscore2 1 1 1 34 24 -9.93 -10.53 2 2 1 39 25 -4.93 -9.53 3 3 1 34 26 -9.93 -8.53 4 4 1 38 20 -5.93 -14.53 5 5 1 32 21 -11.93 -13.53 6 1 2 45 36 1.07 1.47 7 2 2 43 30 -0.93 -4.53 8 3 2 48 39 4.07 4.47 9 4 2 41 37 -2.93 2.47 10 5 2 40 31 -3.93 -3.53 11 1 3 50 46 6.07 11.47 12 2 3 51 49 7.07 14.47 13 3 3 57 48 13.07 13.47 14 4 3 50 40 6.07 5.47 15 5 3 57 46 13.07 11.47
A second way to create a grand mean centered variable is to use proc means, output the means to a data set, and then merge that data set with your original data set. This is illustrated below. The data set outputted from the proc means is shown below. As you can see, it has only one observation. The other thing to notice about this data set is that it has no variables in common with the original data set. This makes merging it with the original data set somewhat more difficult. The steps needed to overcome this problem are explained just above the data set that performs the merge.
proc means data = test mean; var score1 score2; output out = grand1 mean=m1 m2; run; proc print data = grand1; run;Obs _TYPE_ _FREQ_ m1 m2 1 0 15 43.9333 34.5333proc sort data = test; by studentid class; run;
If you try to merge the grand1 data set and the original test data set as you normally would, you will find that you have the values of m1 and m2 only for the first case, and missing values for the remaining 14 cases. Hence, we need to use a do loop to assign the values of m1 and m2 to new variables, which we have called mean1 and mean2. Also, we need to use the retain statement to retain the values of mean1 and mean2 so that their values are not set to missing when the data step iterates the second time. We cannot just retain m1 and m2, because that would be altering their values as we read them into the grand1merged data set, which is not allowed. We use the drop statement to drop the variables m1 and m2, as well as the _type_ and _freq_ variables that were in the grand1 data set. Finally, we calculate the grand mean centered variables that we want, grmscore1 and grmscore2.
data grand1merged; merge test grand1; retain mean1 mean2; if _n_ = 1 then do; mean1 = m1; mean2 = m2; end; drop _freq_ _type_ m1 m2; grmscore1 = score1 - mean1; grmscore2 = score2 - mean2; run; proc print data = grand1merged; run;Obs studentid class score1 score2 mean1 mean2 grmscore1 grmscore2 1 1 1 34 24 43.9333 34.5333 -9.9333 -10.5333 2 1 2 45 36 43.9333 34.5333 1.0667 1.4667 3 1 3 50 46 43.9333 34.5333 6.0667 11.4667 4 2 1 39 25 43.9333 34.5333 -4.9333 -9.5333 5 2 2 43 30 43.9333 34.5333 -0.9333 -4.5333 6 2 3 51 49 43.9333 34.5333 7.0667 14.4667 7 3 1 34 26 43.9333 34.5333 -9.9333 -8.5333 8 3 2 48 39 43.9333 34.5333 4.0667 4.4667 9 3 3 57 48 43.9333 34.5333 13.0667 13.4667 10 4 1 38 20 43.9333 34.5333 -5.9333 -14.5333 11 4 2 41 37 43.9333 34.5333 -2.9333 2.4667 12 4 3 50 40 43.9333 34.5333 6.0667 5.4667 13 5 1 32 21 43.9333 34.5333 -11.9333 -13.5333 14 5 2 40 31 43.9333 34.5333 -3.9333 -3.5333 15 5 3 57 46 43.9333 34.5333 13.0667 11.4667
In the code below, four new variables are created: mean1 is the mean of score1, mean2 is the mean of score2, grandmc1 is the grand mean centered variable for score1 and grandmc2 is the grand mean centered variable for score2.
* grand mean centering using proc sql; proc sql; create table grndmc as select *, mean(score1) as mean1, mean(score2) as mean2, score1 - mean(score1) as grandmc1, score2 - mean(score2) as grandmc2 from test; quit; proc print data = grndmc; run;Obs studentid class score1 score2 mean1 mean2 grandmc1 grandmc2 1 1 1 34 24 43.9333 34.5333 -9.9333 -10.5333 2 1 2 45 36 43.9333 34.5333 1.0667 1.4667 3 1 3 50 46 43.9333 34.5333 6.0667 11.4667 4 2 1 39 25 43.9333 34.5333 -4.9333 -9.5333 5 2 2 43 30 43.9333 34.5333 -0.9333 -4.5333 6 2 3 51 49 43.9333 34.5333 7.0667 14.4667 7 3 1 34 26 43.9333 34.5333 -9.9333 -8.5333 8 3 2 48 39 43.9333 34.5333 4.0667 4.4667 9 3 3 57 48 43.9333 34.5333 13.0667 13.4667 10 4 1 38 20 43.9333 34.5333 -5.9333 -14.5333 11 4 2 41 37 43.9333 34.5333 -2.9333 2.4667 12 4 3 50 40 43.9333 34.5333 6.0667 5.4667 13 5 1 32 21 43.9333 34.5333 -11.9333 -13.5333 14 5 2 40 31 43.9333 34.5333 -3.9333 -3.5333 15 5 3 57 46 43.9333 34.5333 13.0667 11.4667
3. Creating an aggregate variable
There may be times when you want to create an aggregate variable. An aggregate variable is one that aggregates data from a "lower level" to a "higher level". In this example, the students’ test scores (which can be thought of as a level 1 variable) are aggregated to the classroom level (which can be thought of as a level 2 variable). Hence, a new variable is created that is the mean of the test scores for each class.
In the code below, the output statement is used to output the means for each variable (in this case, score1 and score2) to a new data set called aggtest. The means for score1 are put into a variable called m1 and the means for score2 are put into a variable called m2.
proc means data = test mean ; var score1 score2; by class; output out = aggtest mean=m1 m2; run; proc print data = aggtest; run;Obs class _TYPE_ _FREQ_ m1 m2 1 1 0 5 35.4 23.2 2 2 0 5 43.4 34.6 3 3 0 5 53.0 45.8 proc sort data = test; by class; run; data merged; merge test aggtest; by class; drop _TYPE_ _FREQ_; run; proc print data = merged; run;Obs studentid class score1 score2 m1 m2 1 1 1 34 24 35.4 23.2 2 2 1 39 25 35.4 23.2 3 3 1 34 26 35.4 23.2 4 4 1 38 20 35.4 23.2 5 5 1 32 21 35.4 23.2 6 1 2 45 36 43.4 34.6 7 2 2 43 30 43.4 34.6 8 3 2 48 39 43.4 34.6 9 4 2 41 37 43.4 34.6 10 5 2 40 31 43.4 34.6 11 1 3 50 46 53.0 45.8 12 2 3 51 49 53.0 45.8 13 3 3 57 48 53.0 45.8 14 4 3 50 40 53.0 45.8 15 5 3 57 46 53.0 45.8
You can do the same thing using proc sql. In the code below, a data set called aggtestsql is created. In the third line, you can see the mean of score1 is created in stored in a variable called mean1, and the mean for score2 is created and stored in a variable called mean2. The group by statement is needed so that the means are by groups, in this case, the variable class. If this statement was omitted, the means created would be grand means (in other words, means for the whole variable not broken out by classes).
proc sql; create table aggtestsql as select *, mean(score1) as mean1, mean(score2) as mean2 from test group by class; quit; proc print data = aggtestsql; run;Obs studentid class score1 score2 mean1 mean2 1 1 1 34 24 35.4 23.2 2 2 1 39 25 35.4 23.2 3 3 1 34 26 35.4 23.2 4 4 1 38 20 35.4 23.2 5 5 1 32 21 35.4 23.2 6 1 2 45 36 43.4 34.6 7 2 2 43 30 43.4 34.6 8 3 2 48 39 43.4 34.6 9 4 2 41 37 43.4 34.6 10 5 2 40 31 43.4 34.6 11 1 3 50 46 53.0 45.8 12 2 3 51 49 53.0 45.8 13 3 3 57 48 53.0 45.8 14 4 3 50 40 53.0 45.8 15 5 3 57 46 53.0 45.8
4. Group mean centering
Just as there are at least three ways to create a grand mean centered variable, there are at least three different ways to create a group mean centered variable. The first way illustrated below is very straight-forward, but it may be impractical if you have lots of groups (or classes). To save space, we have only group mean centered one variable, score1.
proc means data = test mean; by class; var score1; run;class=1 The MEANS Procedure Analysis Variable : score1 Mean ------------ 34.0000000 ------------ class=2 Analysis Variable : score1 Mean ------------ 45.0000000 ------------ data group; set test; if class = 1 then grpmscore1 = score1 - 35.4; if class = 2 then grpmscore1 = score1 - 43.4; if class = 3 then grpmscore1 = score1 - 53.0; run; proc print data = group; run;Obs studentid class score1 score2 grpmscore1 1 1 1 34 24 -1.4 2 1 2 45 36 1.6 3 1 3 50 46 -3.0 4 2 1 39 25 3.6 5 2 2 43 30 -0.4 6 2 3 51 49 -2.0 7 3 1 34 26 -1.4 8 3 2 48 39 4.6 9 3 3 57 48 4.0 10 4 1 38 20 2.6 11 4 2 41 37 -2.4 12 4 3 50 40 -3.0 13 5 1 32 21 -3.4 14 5 2 40 31 -3.4 15 5 3 57 46 4.0
A second way to create a group mean centered variable is to use proc means, output the means to a data set, and then merge that data set with your original data set. This is shown below.
proc means data = test mean; var score1 score2; by class; output out = grpmeanctr mean=m1 m2; run; proc sort data = test; by class studentid; run; data merged2; merge test grpmeanctr; by class; drop _TYPE_ _FREQ_; groupmc1 = score1 - m1; groupmc2 = score2 - m2; run; proc print data = merged2; run;Obs studentid class score1 score2 m1 m2 groupmc1 groupmc2 1 1 1 34 24 35.4 23.2 -1.4 0.8 2 2 1 39 25 35.4 23.2 3.6 1.8 3 3 1 34 26 35.4 23.2 -1.4 2.8 4 4 1 38 20 35.4 23.2 2.6 -3.2 5 5 1 32 21 35.4 23.2 -3.4 -2.2 6 1 2 45 36 43.4 34.6 1.6 1.4 7 2 2 43 30 43.4 34.6 -0.4 -4.6 8 3 2 48 39 43.4 34.6 4.6 4.4 9 4 2 41 37 43.4 34.6 -2.4 2.4 10 5 2 40 31 43.4 34.6 -3.4 -3.6 11 1 3 50 46 53.0 45.8 -3.0 0.2 12 2 3 51 49 53.0 45.8 -2.0 3.2 13 3 3 57 48 53.0 45.8 4.0 2.2 14 4 3 50 40 53.0 45.8 -3.0 -5.8 15 5 3 57 46 53.0 45.8 4.0 0.2
A third way to accomplish the same thing is to use proc sql. As before, four new variables are being created. You do not have to create the mean1 and mean2 variables; we have included them only for the sake of completeness and to show how this would be done.
proc sql; create table grpmeanctrsql as select *, mean(score1) as mean1, mean(score2) as mean2, score1 - mean(score1) as groupmc1, score2 - mean(score2) as groupmc2 from test group by class; quit; proc print data = grpmeanctrsql; run;Obs studentid class score1 score2 mean1 mean2 groupmc1 groupmc2 1 1 1 34 24 35.4 23.2 -1.4 0.8 2 2 1 39 25 35.4 23.2 3.6 1.8 3 3 1 34 26 35.4 23.2 -1.4 2.8 4 4 1 38 20 35.4 23.2 2.6 -3.2 5 5 1 32 21 35.4 23.2 -3.4 -2.2 6 1 2 45 36 43.4 34.6 1.6 1.4 7 2 2 43 30 43.4 34.6 -0.4 -4.6 8 3 2 48 39 43.4 34.6 4.6 4.4 9 4 2 41 37 43.4 34.6 -2.4 2.4 10 5 2 40 31 43.4 34.6 -3.4 -3.6 11 1 3 50 46 53.0 45.8 -3.0 0.2 12 2 3 51 49 53.0 45.8 -2.0 3.2 13 3 3 57 48 53.0 45.8 4.0 2.2 14 4 3 50 40 53.0 45.8 -3.0 -5.8 15 5 3 57 46 53.0 45.8 4.0 0.2