How can I create lag and lead variables in longitudinal data?

When looking at data across consistent units of time (years, quarters, months), there is often interest in creating variables based on how data for a given time period compares to the periods before and after. If you have longitudinal data, you wish to look across units of time within a single subject. When your data is in long form (one observation per time point per subject), this can easily be handled in Stata with standard variable creation steps because of the way in which Stata processes datasets: it stores the entire dataset and can easily refer to any point in the dataset when generating variables. SAS works differently. SAS variables are typically created through a data step in which SAS moves through the dataset, observation by observation, carrying out the calculations for the given observation and accessing only one observation at a time. This system of data storage and access makes it possible for SAS to analyze large datasets but also very difficult to create time series variables in SAS using a data step. However, proc expand provides an easy-to-use alternative to the data step.

Let’s start with an example dataset containing only one subject. The dataset below contains US unemployment rates from September, 2006 to August, 2008.

data unemp; 
  input year month rate @@; 
  date = mdy( month, 1 , year ); 
  format date yymm.; 
  datalines; 
2006	09	4.5 2006	10	4.4
2006	11	4.5 2006	12	4.4
2007	01	4.6 2007	02	4.5
2007	03	4.4 2007	04	4.5
2007	05	4.5 2007	06	4.6
2007	07	4.7 2007	08	4.7
2007	09	4.7 2007	10	4.8
2007	11	4.7 2007	12	5
2008	01	4.9 2008	02	4.8
2008	03	5.1 2008	04	5
2008	05	5.5 2008	06	5.5
2008	07	5.7 2008	08	6.1
;

proc print data = unemp (obs = 5); run;

 Obs    year    month    rate     date
   1    2006       9      4.5    2006M09
   2    2006      10      4.4    2006M10
   3    2006      11      4.5    2006M11
   4    2006      12      4.4    2006M12
   5    2007       1      4.6    2007M01

For each month, we wish to know the difference between its rate and the rate of the previous month (r(i) – r(i-1)), its rate and the rate of the next month (r(i+1) – r(i)), and these two differences ((r(i+1)-r(i))-(r(i)-r(i-1)). To do this, we will use proc expand to generate a new dataset including these variables. In the proc expand line, we will name the new dataset unemp_laglead. We indicate that we do not wish to transform the values (using a spline, for example) but simply to grab the untransformed data from the specified record. We indicate that our time series is defined by date in the id line and in the three convert lines, we create the three values we wish to have for each time point in our data: the rate, the previous rate (rate_lag1), and the next rate (rate_lead1). In each line, we tell SAS the name of the variable in our new dataset, the type of transformation (lag, lead) and the number of time points to look back or ahead for the transformation (1 in this example).

proc expand data=unemp out=unemp_laglead method = none; 
  id date; 
  convert rate = rate_lag1   / transformout=(lag 1); 
  convert rate; 
  convert rate = rate_lead1  / transformout=(lead 1); 
run;

We can see the resulting dataset.

proc print data = unemp_laglead (obs = 5); run;

                  rate_            rate_
Obs     date       lag1    rate    lead1    year    mont
  1    2006M09      .       4.5     4.4     2006       9
  2    2006M10     4.5      4.4     4.5     2006      10
  3    2006M11     4.4      4.5     4.4     2006      11
  4    2006M12     4.5      4.4     4.6     2006      12
  5    2007M01     4.4      4.6     4.5     2007       1

Based on this dataset, we can now easily calculate the three time series variables we described earlier. But what if we had data for multiple countries? The dataset below contains unemployment data from 2000-2005 for three countries.

data unemp_international; 
  input country $ year rate @@;
  datalines;
US	2000	4	Canada	2000	6.1	UK	2000	5.5
US	2001	4.7	Canada	2001	6.5	UK	2001	5.1
US	2002	5.8	Canada	2002	7	UK	2002	5.2
US	2003	6	Canada	2003	6.9	UK	2003	5
US	2004	5.5	Canada	2004	6.4	UK	2004	4.8
US	2005	5.1	Canada	2005	6	UK	2005	4.9
;

proc print data = unemp_international (obs = 5); run;

Obs    country    year    rate
  1    US         2000     4.0
  2    Canada     2000     6.1
  3    UK         2000     5.5
  4    US         2001     4.7
  5    Canada     2001     6.5

We wish to create lag and lead variables within each country. To do this, we can use proc expand with a by statement after sorting on country.

proc sort data = unemp_international;
  by country;
run;

proc expand data=unemp_international out=unemp_int2 method = none; 
  by country;
  id year; 
  convert rate = rate_lag1   / transformout=(lag 1); 
  convert rate; 
  convert rate = rate_lead1  / transformout=(lead 1); 
run; 

proc print data = unemp_int2; run;

                            rate_            rate_
  Obs    country    year     lag1    rate    lead1
    1    Canada     2000      .       6.1     6.5
    2    Canada     2001     6.1      6.5     7.0
    3    Canada     2002     6.5      7.0     6.9
    4    Canada     2003     7.0      6.9     6.4
    5    Canada     2004     6.9      6.4     6.0
    6    Canada     2005     6.4      6.0      .
    7    UK         2000      .       5.5     5.1
    8    UK         2001     5.5      5.1     5.2
    9    UK         2002     5.1      5.2     5.0
   10    UK         2003     5.2      5.0     4.8
   11    UK         2004     5.0      4.8     4.9
   12    UK         2005     4.8      4.9      .
   13    US         2000      .       4.0     4.7
   14    US         2001     4.0      4.7     5.8
   15    US         2002     4.7      5.8     6.0
   16    US         2003     5.8      6.0     5.5
   17    US         2004     6.0      5.5     5.1
   18    US         2005     5.5      5.1      .

With proc expand, you can also generate moving averages, splines, and interpolated values. For more details, see the proc expand pages of the SAS Online Documentation.